Text::Balanced(3pm)

1Text::Balanced(3)     User Contributed Perl Documentation    Text::Balanced(3)
2
3
4

NAME

6       Text::Balanced - Extract delimited text sequences from strings.
7

SYNOPSIS

9           use Text::Balanced qw (
10               extract_delimited
11               extract_bracketed
12               extract_quotelike
13               extract_codeblock
14               extract_variable
15               extract_tagged
16               extract_multiple
17               gen_delimited_pat
18               gen_extract_tagged
19           );
20
21           # Extract the initial substring of $text that is delimited by
22           # two (unescaped) instances of the first character in $delim.
23
24           ($extracted, $remainder) = extract_delimited($text,$delim);
25
26           # Extract the initial substring of $text that is bracketed
27           # with a delimiter(s) specified by $delim (where the string
28           # in $delim contains one or more of '(){}[]<>').
29
30           ($extracted, $remainder) = extract_bracketed($text,$delim);
31
32           # Extract the initial substring of $text that is bounded by
33           # an XML tag.
34
35           ($extracted, $remainder) = extract_tagged($text);
36
37           # Extract the initial substring of $text that is bounded by
38           # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
39
40           ($extracted, $remainder) =
41               extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
42
43           # Extract the initial substring of $text that represents a
44           # Perl "quote or quote-like operation"
45
46           ($extracted, $remainder) = extract_quotelike($text);
47
48           # Extract the initial substring of $text that represents a block
49           # of Perl code, bracketed by any of character(s) specified by $delim
50           # (where the string $delim contains one or more of '(){}[]<>').
51
52           ($extracted, $remainder) = extract_codeblock($text,$delim);
53
54           # Extract the initial substrings of $text that would be extracted by
55           # one or more sequential applications of the specified functions
56           # or regular expressions
57
58           @extracted = extract_multiple($text,
59                                         [ \&extract_bracketed,
60                                           \&extract_quotelike,
61                                           \&some_other_extractor_sub,
62                                           qr/[xyz]*/,
63                                           'literal',
64                                         ]);
65
66           # Create a string representing an optimized pattern (a la Friedl)
67           # that matches a substring delimited by any of the specified characters
68           # (in this case: any type of quote or a slash)
69
70           $patstring = gen_delimited_pat(q{'"`/});
71
72           # Generate a reference to an anonymous sub that is just like extract_tagged
73           # but pre-compiled and optimized for a specific pair of tags, and
74           # consequently much faster (i.e. 3 times faster). It uses qr// for better
75           # performance on repeated calls.
76
77           $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
78           ($extracted, $remainder) = $extract_head->($text);
79

DESCRIPTION

81       The various "extract_..." subroutines may be used to extract a
82       delimited substring, possibly after skipping a specified prefix string.
83       By default, that prefix is optional whitespace ("/\s*/"), but you can
84       change it to whatever you wish (see below).
85
86       The substring to be extracted must appear at the current "pos" location
87       of the string's variable (or at index zero, if no "pos" position is
88       defined).  In other words, the "extract_..." subroutines don't extract
89       the first occurrence of a substring anywhere in a string (like an
90       unanchored regex would). Rather, they extract an occurrence of the
91       substring appearing immediately at the current matching position in the
92       string (like a "\G"-anchored regex would).
93
94   General Behaviour in List Contexts
95       In a list context, all the subroutines return a list, the first three
96       elements of which are always:
97
98       [0] The extracted string, including the specified delimiters.  If the
99           extraction fails "undef" is returned.
100
101       [1] The remainder of the input string (i.e. the characters after the
102           extracted string). On failure, the entire string is returned.
103
104       [2] The skipped prefix (i.e. the characters before the extracted
105           string).  On failure, "undef" is returned.
106
107       Note that in a list context, the contents of the original input text
108       (the first argument) are not modified in any way.
109
110       However, if the input text was passed in a variable, that variable's
111       "pos" value is updated to point at the first character after the
112       extracted text. That means that in a list context the various
113       subroutines can be used much like regular expressions. For example:
114
115           while ( $next = (extract_quotelike($text))[0] )
116           {
117               # process next quote-like (in $next)
118           }
119
120   General Behaviour in Scalar and Void Contexts
121       In a scalar context, the extracted string is returned, having first
122       been removed from the input text. Thus, the following code also
123       processes each quote-like operation, but actually removes them from
124       $text:
125
126           while ( $next = extract_quotelike($text) )
127           {
128               # process next quote-like (in $next)
129           }
130
131       Note that if the input text is a read-only string (i.e. a literal), no
132       attempt is made to remove the extracted text.
133
134       In a void context the behaviour of the extraction subroutines is
135       exactly the same as in a scalar context, except (of course) that the
136       extracted substring is not returned.
137
138   A Note About Prefixes
139       Prefix patterns are matched without any trailing modifiers ("/gimsox"
140       etc.)  This can bite you if you're expecting a prefix specification
141       like '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a
142       prefix pattern will only succeed if the <H1> tag is on the current
143       line, since . normally doesn't match newlines.
144
145       To overcome this limitation, you need to turn on /s matching within the
146       prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)'
147
148   Functions
149       "extract_delimited"
150           The "extract_delimited" function formalizes the common idiom of
151           extracting a single-character-delimited substring from the start of
152           a string. For example, to extract a single-quote delimited string,
153           the following code is typically used:
154
155               ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
156               $extracted = $1;
157
158           but with "extract_delimited" it can be simplified to:
159
160               ($extracted,$remainder) = extract_delimited($text, "'");
161
162           "extract_delimited" takes up to four scalars (the input text, the
163           delimiters, a prefix pattern to be skipped, and any escape
164           characters) and extracts the initial substring of the text that is
165           appropriately delimited. If the delimiter string has multiple
166           characters, the first one encountered in the text is taken to
167           delimit the substring.  The third argument specifies a prefix
168           pattern that is to be skipped (but must be present!) before the
169           substring is extracted.  The final argument specifies the escape
170           character to be used for each delimiter.
171
172           All arguments are optional. If the escape characters are not
173           specified, every delimiter is escaped with a backslash ("\").  If
174           the prefix is not specified, the pattern '\s*' - optional
175           whitespace - is used. If the delimiter set is also not specified,
176           the set "/["'`]/" is used. If the text to be processed is not
177           specified either, $_ is used.
178
179           In list context, "extract_delimited" returns a array of three
180           elements, the extracted substring (including the surrounding
181           delimiters), the remainder of the text, and the skipped prefix (if
182           any). If a suitable delimited substring is not found, the first
183           element of the array is the empty string, the second is the
184           complete original text, and the prefix returned in the third
185           element is an empty string.
186
187           In a scalar context, just the extracted substring is returned. In a
188           void context, the extracted substring (and any prefix) are simply
189           removed from the beginning of the first argument.
190
191           Examples:
192
193               # Remove a single-quoted substring from the very beginning of $text:
194
195                   $substring = extract_delimited($text, "'", '');
196
197               # Remove a single-quoted Pascalish substring (i.e. one in which
198               # doubling the quote character escapes it) from the very
199               # beginning of $text:
200
201                   $substring = extract_delimited($text, "'", '', "'");
202
203               # Extract a single- or double- quoted substring from the
204               # beginning of $text, optionally after some whitespace
205               # (note the list context to protect $text from modification):
206
207                   ($substring) = extract_delimited $text, q{"'};
208
209               # Delete the substring delimited by the first '/' in $text:
210
211                   $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
212
213           Note that this last example is not the same as deleting the first
214           quote-like pattern. For instance, if $text contained the string:
215
216               "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
217
218           then after the deletion it would contain:
219
220               "if ('.$UNIXCMD/s) { $cmd = $1; }"
221
222           not:
223
224               "if ('./cmd' =~ ms) { $cmd = $1; }"
225
226           See "extract_quotelike" for a (partial) solution to this problem.
227
228       "extract_bracketed"
229           Like "extract_delimited", the "extract_bracketed" function takes up
230           to three optional scalar arguments: a string to extract from, a
231           delimiter specifier, and a prefix pattern. As before, a missing
232           prefix defaults to optional whitespace and a missing text defaults
233           to $_. However, a missing delimiter specifier defaults to
234           '{}()[]<>' (see below).
235
236           "extract_bracketed" extracts a balanced-bracket-delimited substring
237           (using any one (or more) of the user-specified delimiter brackets:
238           '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect
239           quoted unbalanced brackets (see below).
240
241           A "delimiter bracket" is a bracket in list of delimiters passed as
242           "extract_bracketed"'s second argument. Delimiter brackets are
243           specified by giving either the left or right (or both!) versions of
244           the required bracket(s). Note that the order in which two or more
245           delimiter brackets are specified is not significant.
246
247           A "balanced-bracket-delimited substring" is a substring bounded by
248           matched brackets, such that any other (left or right) delimiter
249           bracket within the substring is also matched by an opposite (right
250           or left) delimiter bracket at the same level of nesting. Any type
251           of bracket not in the delimiter list is treated as an ordinary
252           character.
253
254           In other words, each type of bracket specified as a delimiter must
255           be balanced and correctly nested within the substring, and any
256           other kind of ("non-delimiter") bracket in the substring is
257           ignored.
258
259           For example, given the string:
260
261               $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
262
263           then a call to "extract_bracketed" in a list context:
264
265               @result = extract_bracketed( $text, '{}' );
266
267           would return:
268
269               ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
270
271           since both sets of '{..}' brackets are properly nested and evenly
272           balanced.  (In a scalar context just the first element of the array
273           would be returned. In a void context, $text would be replaced by an
274           empty string.)
275
276           Likewise the call in:
277
278               @result = extract_bracketed( $text, '{[' );
279
280           would return the same result, since all sets of both types of
281           specified delimiter brackets are correctly nested and balanced.
282
283           However, the call in:
284
285               @result = extract_bracketed( $text, '{([<' );
286
287           would fail, returning:
288
289               ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }"  );
290
291           because the embedded pairs of '(..)'s and '[..]'s are "cross-
292           nested" and the embedded '>' is unbalanced. (In a scalar context,
293           this call would return an empty string. In a void context, $text
294           would be unchanged.)
295
296           Note that the embedded single-quotes in the string don't help in
297           this case, since they have not been specified as acceptable
298           delimiters and are therefore treated as non-delimiter characters
299           (and ignored).
300
301           However, if a particular species of quote character is included in
302           the delimiter specification, then that type of quote will be
303           correctly handled.  for example, if $text is:
304
305               $text = '<A HREF=">>>>">link</A>';
306
307           then
308
309               @result = extract_bracketed( $text, '<">' );
310
311           returns:
312
313               ( '<A HREF=">>>>">', 'link</A>', "" )
314
315           as expected. Without the specification of """ as an embedded
316           quoter:
317
318               @result = extract_bracketed( $text, '<>' );
319
320           the result would be:
321
322               ( '<A HREF=">', '>>>">link</A>', "" )
323
324           In addition to the quote delimiters "'", """, and "`", full Perl
325           quote-like quoting (i.e. q{string}, qq{string}, etc) can be
326           specified by including the letter 'q' as a delimiter. Hence:
327
328               @result = extract_bracketed( $text, '<q>' );
329
330           would correctly match something like this:
331
332               $text = '<leftop: conj /and/ conj>';
333
334           See also: "extract_quotelike" and "extract_codeblock".
335
336       "extract_variable"
337           "extract_variable" extracts any valid Perl variable or variable-
338           involved expression, including scalars, arrays, hashes, array
339           accesses, hash look-ups, method calls through objects, subroutine
340           calls through subroutine references, etc.
341
342           The subroutine takes up to two optional arguments:
343
344           1.  A string to be processed ($_ if the string is omitted or
345               "undef")
346
347           2.  A string specifying a pattern to be matched as a prefix (which
348               is to be skipped). If omitted, optional whitespace is skipped.
349
350           On success in a list context, an array of 3 elements is returned.
351           The elements are:
352
353           [0] the extracted variable, or variablish expression
354
355           [1] the remainder of the input text,
356
357           [2] the prefix substring (if any),
358
359           On failure, all of these values (except the remaining text) are
360           "undef".
361
362           In a scalar context, "extract_variable" returns just the complete
363           substring that matched a variablish expression. "undef" is returned
364           on failure. In addition, the original input text has the returned
365           substring (and any prefix) removed from it.
366
367           In a void context, the input text just has the matched substring
368           (and any specified prefix) removed.
369
370       "extract_tagged"
371           "extract_tagged" extracts and segments text between (balanced)
372           specified tags.
373
374           The subroutine takes up to five optional arguments:
375
376           1.  A string to be processed ($_ if the string is omitted or
377               "undef")
378
379           2.  A string specifying a pattern to be matched as the opening tag.
380               If the pattern string is omitted (or "undef") then a pattern
381               that matches any standard XML tag is used.
382
383           3.  A string specifying a pattern to be matched at the closing tag.
384               If the pattern string is omitted (or "undef") then the closing
385               tag is constructed by inserting a "/" after any leading bracket
386               characters in the actual opening tag that was matched (not the
387               pattern that matched the tag). For example, if the opening tag
388               pattern is specified as '{{\w+}}' and actually matched the
389               opening tag "{{DATA}}", then the constructed closing tag would
390               be "{{/DATA}}".
391
392           4.  A string specifying a pattern to be matched as a prefix (which
393               is to be skipped). If omitted, optional whitespace is skipped.
394
395           5.  A hash reference containing various parsing options (see below)
396
397           The various options that can be specified are:
398
399           "reject => $listref"
400               The list reference contains one or more strings specifying
401               patterns that must not appear within the tagged text.
402
403               For example, to extract an HTML link (which should not contain
404               nested links) use:
405
406                       extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
407
408           "ignore => $listref"
409               The list reference contains one or more strings specifying
410               patterns that are not to be treated as nested tags within the
411               tagged text (even if they would match the start tag pattern).
412
413               For example, to extract an arbitrary XML tag, but ignore
414               "empty" elements:
415
416                       extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
417
418               (also see "gen_delimited_pat" below).
419
420           "fail => $str"
421               The "fail" option indicates the action to be taken if a
422               matching end tag is not encountered (i.e. before the end of the
423               string or some "reject" pattern matches). By default, a failure
424               to match a closing tag causes "extract_tagged" to immediately
425               fail.
426
427               However, if the string value associated with <reject> is "MAX",
428               then "extract_tagged" returns the complete text up to the point
429               of failure.  If the string is "PARA", "extract_tagged" returns
430               only the first paragraph after the tag (up to the first line
431               that is either empty or contains only whitespace characters).
432               If the string is "", the default behaviour (i.e. failure) is
433               reinstated.
434
435               For example, suppose the start tag "/para" introduces a
436               paragraph, which then continues until the next "/endpara" tag
437               or until another "/para" tag is encountered:
438
439                       $text = "/para line 1\n\nline 3\n/para line 4";
440
441                       extract_tagged($text, '/para', '/endpara', undef,
442                                               {reject => '/para', fail => MAX );
443
444                       # EXTRACTED: "/para line 1\n\nline 3\n"
445
446               Suppose instead, that if no matching "/endpara" tag is found,
447               the "/para" tag refers only to the immediately following
448               paragraph:
449
450                       $text = "/para line 1\n\nline 3\n/para line 4";
451
452                       extract_tagged($text, '/para', '/endpara', undef,
453                                       {reject => '/para', fail => MAX );
454
455                       # EXTRACTED: "/para line 1\n"
456
457               Note that the specified "fail" behaviour applies to nested tags
458               as well.
459
460           On success in a list context, an array of 6 elements is returned.
461           The elements are:
462
463           [0] the extracted tagged substring (including the outermost tags),
464
465           [1] the remainder of the input text,
466
467           [2] the prefix substring (if any),
468
469           [3] the opening tag
470
471           [4] the text between the opening and closing tags
472
473           [5] the closing tag (or "" if no closing tag was found)
474
475           On failure, all of these values (except the remaining text) are
476           "undef".
477
478           In a scalar context, "extract_tagged" returns just the complete
479           substring that matched a tagged text (including the start and end
480           tags). "undef" is returned on failure. In addition, the original
481           input text has the returned substring (and any prefix) removed from
482           it.
483
484           In a void context, the input text just has the matched substring
485           (and any specified prefix) removed.
486
487       "gen_extract_tagged"
488           "gen_extract_tagged" generates a new anonymous subroutine which
489           extracts text between (balanced) specified tags. In other words, it
490           generates a function identical in function to "extract_tagged".
491
492           The difference between "extract_tagged" and the anonymous
493           subroutines generated by "gen_extract_tagged", is that those
494           generated subroutines:
495
496           ·   do not have to reparse tag specification or parsing options
497               every time they are called (whereas "extract_tagged" has to
498               effectively rebuild its tag parser on every call);
499
500           ·   make use of the new qr// construct to pre-compile the regexes
501               they use (whereas "extract_tagged" uses standard string
502               variable interpolation to create tag-matching patterns).
503
504           The subroutine takes up to four optional arguments (the same set as
505           "extract_tagged" except for the string to be processed). It returns
506           a reference to a subroutine which in turn takes a single argument
507           (the text to be extracted from).
508
509           In other words, the implementation of "extract_tagged" is exactly
510           equivalent to:
511
512                   sub extract_tagged
513                   {
514                           my $text = shift;
515                           $extractor = gen_extract_tagged(@_);
516                           return $extractor->($text);
517                   }
518
519           (although "extract_tagged" is not currently implemented that way).
520
521           Using "gen_extract_tagged" to create extraction functions for
522           specific tags is a good idea if those functions are going to be
523           called more than once, since their performance is typically twice
524           as good as the more general-purpose "extract_tagged".
525
526       "extract_quotelike"
527           "extract_quotelike" attempts to recognize, extract, and segment any
528           one of the various Perl quotes and quotelike operators (see
529           perlop(3)) Nested backslashed delimiters, embedded balanced bracket
530           delimiters (for the quotelike operators), and trailing modifiers
531           are all caught. For example, in:
532
533                   extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
534
535                   extract_quotelike '  "You said, \"Use sed\"."  '
536
537                   extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
538
539                   extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
540
541           the full Perl quotelike operations are all extracted correctly.
542
543           Note too that, when using the /x modifier on a regex, any comment
544           containing the current pattern delimiter will cause the regex to be
545           immediately terminated. In other words:
546
547                   'm /
548                           (?i)            # CASE INSENSITIVE
549                           [a-z_]          # LEADING ALPHABETIC/UNDERSCORE
550                           [a-z0-9]*       # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
551                      /x'
552
553           will be extracted as if it were:
554
555                   'm /
556                           (?i)            # CASE INSENSITIVE
557                           [a-z_]          # LEADING ALPHABETIC/'
558
559           This behaviour is identical to that of the actual compiler.
560
561           "extract_quotelike" takes two arguments: the text to be processed
562           and a prefix to be matched at the very beginning of the text. If no
563           prefix is specified, optional whitespace is the default. If no text
564           is given, $_ is used.
565
566           In a list context, an array of 11 elements is returned. The
567           elements are:
568
569           [0] the extracted quotelike substring (including trailing
570               modifiers),
571
572           [1] the remainder of the input text,
573
574           [2] the prefix substring (if any),
575
576           [3] the name of the quotelike operator (if any),
577
578           [4] the left delimiter of the first block of the operation,
579
580           [5] the text of the first block of the operation (that is, the
581               contents of a quote, the regex of a match or substitution or
582               the target list of a translation),
583
584           [6] the right delimiter of the first block of the operation,
585
586           [7] the left delimiter of the second block of the operation (that
587               is, if it is a "s", "tr", or "y"),
588
589           [8] the text of the second block of the operation (that is, the
590               replacement of a substitution or the translation list of a
591               translation),
592
593           [9] the right delimiter of the second block of the operation (if
594               any),
595
596           [10]
597               the trailing modifiers on the operation (if any).
598
599           For each of the fields marked "(if any)" the default value on
600           success is an empty string.  On failure, all of these values
601           (except the remaining text) are "undef".
602
603           In a scalar context, "extract_quotelike" returns just the complete
604           substring that matched a quotelike operation (or "undef" on
605           failure). In a scalar or void context, the input text has the same
606           substring (and any specified prefix) removed.
607
608           Examples:
609
610                   # Remove the first quotelike literal that appears in text
611
612                           $quotelike = extract_quotelike($text,'.*?');
613
614                   # Replace one or more leading whitespace-separated quotelike
615                   # literals in $_ with "<QLL>"
616
617                           do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
618
619
620                   # Isolate the search pattern in a quotelike operation from $text
621
622                           ($op,$pat) = (extract_quotelike $text)[3,5];
623                           if ($op =~ /[ms]/)
624                           {
625                                   print "search pattern: $pat\n";
626                           }
627                           else
628                           {
629                                   print "$op is not a pattern matching operation\n";
630                           }
631
632       "extract_quotelike"
633           "extract_quotelike" can successfully extract "here documents" from
634           an input string, but with an important caveat in list contexts.
635
636           Unlike other types of quote-like literals, a here document is
637           rarely a contiguous substring. For example, a typical piece of code
638           using here document might look like this:
639
640                   <<'EOMSG' || die;
641                   This is the message.
642                   EOMSG
643                   exit;
644
645           Given this as an input string in a scalar context,
646           "extract_quotelike" would correctly return the string
647           "<<'EOMSG'\nThis is the message.\nEOMSG", leaving the string " ||
648           die;\nexit;" in the original variable. In other words, the two
649           separate pieces of the here document are successfully extracted and
650           concatenated.
651
652           In a list context, "extract_quotelike" would return the list
653
654           [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full
655               extracted here document, including fore and aft delimiters),
656
657           [1] " || die;\nexit;" (i.e. the remainder of the input text,
658               concatenated),
659
660           [2] "" (i.e. the prefix substring -- trivial in this case),
661
662           [3] "<<" (i.e. the "name" of the quotelike operator)
663
664           [4] "'EOMSG'" (i.e. the left delimiter of the here document,
665               including any quotes),
666
667           [5] "This is the message.\n" (i.e. the text of the here document),
668
669           [6] "EOMSG" (i.e. the right delimiter of the here document),
670
671           [7..10]
672               "" (a here document has no second left delimiter, second text,
673               second right delimiter, or trailing modifiers).
674
675           However, the matching position of the input variable would be set
676           to "exit;" (i.e. after the closing delimiter of the here document),
677           which would cause the earlier " || die;\nexit;" to be skipped in
678           any sequence of code fragment extractions.
679
680           To avoid this problem, when it encounters a here document whilst
681           extracting from a modifiable string, "extract_quotelike" silently
682           rearranges the string to an equivalent piece of Perl:
683
684                   <<'EOMSG'
685                   This is the message.
686                   EOMSG
687                   || die;
688                   exit;
689
690           in which the here document is contiguous. It still leaves the
691           matching position after the here document, but now the rest of the
692           line on which the here document starts is not skipped.
693
694           To prevent <extract_quotelike> from mucking about with the input in
695           this way (this is the only case where a list-context
696           "extract_quotelike" does so), you can pass the input variable as an
697           interpolated literal:
698
699                   $quotelike = extract_quotelike("$var");
700
701       "extract_codeblock"
702           "extract_codeblock" attempts to recognize and extract a balanced
703           bracket delimited substring that may contain unbalanced brackets
704           inside Perl quotes or quotelike operations. That is,
705           "extract_codeblock" is like a combination of "extract_bracketed"
706           and "extract_quotelike".
707
708           "extract_codeblock" takes the same initial three parameters as
709           "extract_bracketed": a text to process, a set of delimiter brackets
710           to look for, and a prefix to match first. It also takes an optional
711           fourth parameter, which allows the outermost delimiter brackets to
712           be specified separately (see below).
713
714           Omitting the first argument (input text) means process $_ instead.
715           Omitting the second argument (delimiter brackets) indicates that
716           only '{' is to be used.  Omitting the third argument (prefix
717           argument) implies optional whitespace at the start.  Omitting the
718           fourth argument (outermost delimiter brackets) indicates that the
719           value of the second argument is to be used for the outermost
720           delimiters.
721
722           Once the prefix and the outermost opening delimiter bracket have
723           been recognized, code blocks are extracted by stepping through the
724           input text and trying the following alternatives in sequence:
725
726           1.  Try and match a closing delimiter bracket. If the bracket was
727               the same species as the last opening bracket, return the
728               substring to that point. If the bracket was mismatched, return
729               an error.
730
731           2.  Try to match a quote or quotelike operator. If found, call
732               "extract_quotelike" to eat it. If "extract_quotelike" fails,
733               return the error it returned. Otherwise go back to step 1.
734
735           3.  Try to match an opening delimiter bracket. If found, call
736               "extract_codeblock" recursively to eat the embedded block. If
737               the recursive call fails, return an error. Otherwise, go back
738               to step 1.
739
740           4.  Unconditionally match a bareword or any other single character,
741               and then go back to step 1.
742
743           Examples:
744
745                   # Find a while loop in the text
746
747                           if ($text =~ s/.*?while\s*\{/{/)
748                           {
749                                   $loop = "while " . extract_codeblock($text);
750                           }
751
752                   # Remove the first round-bracketed list (which may include
753                   # round- or curly-bracketed code blocks or quotelike operators)
754
755                           extract_codeblock $text, "(){}", '[^(]*';
756
757           The ability to specify a different outermost delimiter bracket is
758           useful in some circumstances. For example, in the Parse::RecDescent
759           module, parser actions which are to be performed only on a
760           successful parse are specified using a "<defer:...>" directive. For
761           example:
762
763                   sentence: subject verb object
764                                   <defer: {$::theVerb = $item{verb}} >
765
766           Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to
767           extract the code within the "<defer:...>" directive, but there's a
768           problem.
769
770           A deferred action like this:
771
772                                   <defer: {if ($count>10) {$count--}} >
773
774           will be incorrectly parsed as:
775
776                                   <defer: {if ($count>
777
778           because the "less than" operator is interpreted as a closing
779           delimiter.
780
781           But, by extracting the directive using
782           "extract_codeblock($text, '{}', undef, '<>')" the '>' character is
783           only treated as a delimited at the outermost level of the code
784           block, so the directive is parsed correctly.
785
786       "extract_multiple"
787           The "extract_multiple" subroutine takes a string to be processed
788           and a list of extractors (subroutines or regular expressions) to
789           apply to that string.
790
791           In an array context "extract_multiple" returns an array of
792           substrings of the original string, as extracted by the specified
793           extractors.  In a scalar context, "extract_multiple" returns the
794           first substring successfully extracted from the original string. In
795           both scalar and void contexts the original string has the first
796           successfully extracted substring removed from it. In all contexts
797           "extract_multiple" starts at the current "pos" of the string, and
798           sets that "pos" appropriately after it matches.
799
800           Hence, the aim of a call to "extract_multiple" in a list context is
801           to split the processed string into as many non-overlapping fields
802           as possible, by repeatedly applying each of the specified
803           extractors to the remainder of the string. Thus "extract_multiple"
804           is a generalized form of Perl's "split" subroutine.
805
806           The subroutine takes up to four optional arguments:
807
808           1.  A string to be processed ($_ if the string is omitted or
809               "undef")
810
811           2.  A reference to a list of subroutine references and/or qr//
812               objects and/or literal strings and/or hash references,
813               specifying the extractors to be used to split the string. If
814               this argument is omitted (or "undef") the list:
815
816                       [
817                               sub { extract_variable($_[0], '') },
818                               sub { extract_quotelike($_[0],'') },
819                               sub { extract_codeblock($_[0],'{}','') },
820                       ]
821
822               is used.
823
824           3.  An number specifying the maximum number of fields to return. If
825               this argument is omitted (or "undef"), split continues as long
826               as possible.
827
828               If the third argument is N, then extraction continues until N
829               fields have been successfully extracted, or until the string
830               has been completely processed.
831
832               Note that in scalar and void contexts the value of this
833               argument is automatically reset to 1 (under "-w", a warning is
834               issued if the argument has to be reset).
835
836           4.  A value indicating whether unmatched substrings (see below)
837               within the text should be skipped or returned as fields. If the
838               value is true, such substrings are skipped. Otherwise, they are
839               returned.
840
841           The extraction process works by applying each extractor in sequence
842           to the text string.
843
844           If the extractor is a subroutine it is called in a list context and
845           is expected to return a list of a single element, namely the
846           extracted text. It may optionally also return two further
847           arguments: a string representing the text left after extraction
848           (like $' for a pattern match), and a string representing any prefix
849           skipped before the extraction (like $` in a pattern match). Note
850           that this is designed to facilitate the use of other Text::Balanced
851           subroutines with "extract_multiple". Note too that the value
852           returned by an extractor subroutine need not bear any relationship
853           to the corresponding substring of the original text (see examples
854           below).
855
856           If the extractor is a precompiled regular expression or a string,
857           it is matched against the text in a scalar context with a leading
858           '\G' and the gc modifiers enabled. The extracted value is either $1
859           if that variable is defined after the match, or else the complete
860           match (i.e. $&).
861
862           If the extractor is a hash reference, it must contain exactly one
863           element.  The value of that element is one of the above extractor
864           types (subroutine reference, regular expression, or string).  The
865           key of that element is the name of a class into which the
866           successful return value of the extractor will be blessed.
867
868           If an extractor returns a defined value, that value is immediately
869           treated as the next extracted field and pushed onto the list of
870           fields.  If the extractor was specified in a hash reference, the
871           field is also blessed into the appropriate class,
872
873           If the extractor fails to match (in the case of a regex extractor),
874           or returns an empty list or an undefined value (in the case of a
875           subroutine extractor), it is assumed to have failed to extract.  If
876           none of the extractor subroutines succeeds, then one character is
877           extracted from the start of the text and the extraction subroutines
878           reapplied. Characters which are thus removed are accumulated and
879           eventually become the next field (unless the fourth argument is
880           true, in which case they are discarded).
881
882           For example, the following extracts substrings that are valid Perl
883           variables:
884
885                   @fields = extract_multiple($text,
886                                              [ sub { extract_variable($_[0]) } ],
887                                              undef, 1);
888
889           This example separates a text into fields which are quote
890           delimited, curly bracketed, and anything else. The delimited and
891           bracketed parts are also blessed to identify them (the "anything
892           else" is unblessed):
893
894                   @fields = extract_multiple($text,
895                              [
896                                   { Delim => sub { extract_delimited($_[0],q{'"}) } },
897                                   { Brack => sub { extract_bracketed($_[0],'{}') } },
898                              ]);
899
900           This call extracts the next single substring that is a valid Perl
901           quotelike operator (and removes it from $text):
902
903                   $quotelike = extract_multiple($text,
904                                                 [
905                                                   sub { extract_quotelike($_[0]) },
906                                                 ], undef, 1);
907
908           Finally, here is yet another way to do comma-separated value
909           parsing:
910
911                   @fields = extract_multiple($csv_text,
912                                             [
913                                                   sub { extract_delimited($_[0],q{'"}) },
914                                                   qr/([^,]+)(.*)/,
915                                             ],
916                                             undef,1);
917
918           The list in the second argument means: "Try and extract a ' or "
919           delimited string, otherwise extract anything up to a comma...".
920           The undef third argument means: "...as many times as possible...",
921           and the true value in the fourth argument means "...discarding
922           anything else that appears (i.e. the commas)".
923
924           If you wanted the commas preserved as separate fields (i.e. like
925           split does if your split pattern has capturing parentheses), you
926           would just make the last parameter undefined (or remove it).
927
928       "gen_delimited_pat"
929           The "gen_delimited_pat" subroutine takes a single (string) argument
930           and
931              > builds a Friedl-style optimized regex that matches a string
932           delimited by any one of the characters in the single argument. For
933           example:
934
935                   gen_delimited_pat(q{'"})
936
937           returns the regex:
938
939                   (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
940
941           Note that the specified delimiters are automatically quotemeta'd.
942
943           A typical use of "gen_delimited_pat" would be to build special
944           purpose tags for "extract_tagged". For example, to properly ignore
945           "empty" XML elements (which might contain quoted strings):
946
947                   my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
948
949                   extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
950
951           "gen_delimited_pat" may also be called with an optional second
952           argument, which specifies the "escape" character(s) to be used for
953           each delimiter.  For example to match a Pascal-style string (where
954           ' is the delimiter and '' is a literal ' within the string):
955
956                   gen_delimited_pat(q{'},q{'});
957
958           Different escape characters can be specified for different
959           delimiters.  For example, to specify that '/' is the escape for
960           single quotes and '%' is the escape for double quotes:
961
962                   gen_delimited_pat(q{'"},q{/%});
963
964           If more delimiters than escape chars are specified, the last escape
965           char is used for the remaining delimiters.  If no escape char is
966           specified for a given specified delimiter, '\' is used.
967
968       "delimited_pat"
969           Note that "gen_delimited_pat" was previously called
970           "delimited_pat".  That name may still be used, but is now
971           deprecated.
972

DIAGNOSTICS

974       In a list context, all the functions return "(undef,$original_text)" on
975       failure. In a scalar context, failure is indicated by returning "undef"
976       (in this case the input text is not modified in any way).
977
978       In addition, on failure in any context, the $@ variable is set.
979       Accessing "$@->{error}" returns one of the error diagnostics listed
980       below.  Accessing "$@->{pos}" returns the offset into the original
981       string at which the error was detected (although not necessarily where
982       it occurred!)  Printing $@ directly produces the error message, with
983       the offset appended.  On success, the $@ variable is guaranteed to be
984       "undef".
985
986       The available diagnostics are:
987
988       "Did not find a suitable bracket: "%s""
989           The delimiter provided to "extract_bracketed" was not one of
990           '()[]<>{}'.
991
992       "Did not find prefix: /%s/"
993           A non-optional prefix was specified but wasn't found at the start
994           of the text.
995
996       "Did not find opening bracket after prefix: "%s""
997           "extract_bracketed" or "extract_codeblock" was expecting a
998           particular kind of bracket at the start of the text, and didn't
999           find it.
1000
1001       "No quotelike operator found after prefix: "%s""
1002           "extract_quotelike" didn't find one of the quotelike operators "q",
1003           "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it
1004           was extracting.
1005
1006       "Unmatched closing bracket: "%c""
1007           "extract_bracketed", "extract_quotelike" or "extract_codeblock"
1008           encountered a closing bracket where none was expected.
1009
1010       "Unmatched opening bracket(s): "%s""
1011           "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran
1012           out of characters in the text before closing one or more levels of
1013           nested brackets.
1014
1015       "Unmatched embedded quote (%s)"
1016           "extract_bracketed" attempted to match an embedded quoted
1017           substring, but failed to find a closing quote to match it.
1018
1019       "Did not find closing delimiter to match '%s'"
1020           "extract_quotelike" was unable to find a closing delimiter to match
1021           the one that opened the quote-like operation.
1022
1023       "Mismatched closing bracket: expected "%c" but found "%s""
1024           "extract_bracketed", "extract_quotelike" or "extract_codeblock"
1025           found a valid bracket delimiter, but it was the wrong species. This
1026           usually indicates a nesting error, but may indicate incorrect
1027           quoting or escaping.
1028
1029       "No block delimiter found after quotelike "%s""
1030           "extract_quotelike" or "extract_codeblock" found one of the
1031           quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without
1032           a suitable block after it.
1033
1034       "Did not find leading dereferencer"
1035           "extract_variable" was expecting one of '$', '@', or '%' at the
1036           start of a variable, but didn't find any of them.
1037
1038       "Bad identifier after dereferencer"
1039           "extract_variable" found a '$', '@', or '%' indicating a variable,
1040           but that character was not followed by a legal Perl identifier.
1041
1042       "Did not find expected opening bracket at %s"
1043           "extract_codeblock" failed to find any of the outermost opening
1044           brackets that were specified.
1045
1046       "Improperly nested codeblock at %s"
1047           A nested code block was found that started with a delimiter that
1048           was specified as being only to be used as an outermost bracket.
1049
1050       "Missing second block for quotelike "%s""
1051           "extract_codeblock" or "extract_quotelike" found one of the
1052           quotelike operators "s", "tr" or "y" followed by only one block.
1053
1054       "No match found for opening bracket"
1055           "extract_codeblock" failed to find a closing bracket to match the
1056           outermost opening bracket.
1057
1058       "Did not find opening tag: /%s/"
1059           "extract_tagged" did not find a suitable opening tag (after any
1060           specified prefix was removed).
1061
1062       "Unable to construct closing tag to match: /%s/"
1063           "extract_tagged" matched the specified opening tag and tried to
1064           modify the matched text to produce a matching closing tag (because
1065           none was specified). It failed to generate the closing tag, almost
1066           certainly because the opening tag did not start with a bracket of
1067           some kind.
1068
1069       "Found invalid nested tag: %s"
1070           "extract_tagged" found a nested tag that appeared in the "reject"
1071           list (and the failure mode was not "MAX" or "PARA").
1072
1073       "Found unbalanced nested tag: %s"
1074           "extract_tagged" found a nested opening tag that was not matched by
1075           a corresponding nested closing tag (and the failure mode was not
1076           "MAX" or "PARA").
1077
1078       "Did not find closing tag"
1079           "extract_tagged" reached the end of the text without finding a
1080           closing tag to match the original opening tag (and the failure mode
1081           was not "MAX" or "PARA").
1082

EXPORTS

1084       The following symbols are, or can be, exported by this module:
1085
1086       Default Exports
1087           None.
1088
1089       Optional Exports
1090           "extract_delimited", "extract_bracketed", "extract_quotelike",
1091           "extract_codeblock", "extract_variable", "extract_tagged",
1092           "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",
1093           "delimited_pat".
1094
1095       Export Tags
1096           ":ALL"
1097               "extract_delimited", "extract_bracketed", "extract_quotelike",
1098               "extract_codeblock", "extract_variable", "extract_tagged",
1099               "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",
1100               "delimited_pat".
1101

KNOWN BUGS

1103       See
1104       <https://rt.cpan.org/Dist/Display.html?Status=Active&Queue=Text-Balanced>.
1105

FEEDBACK

1107       Patches, bug reports, suggestions or any other feedback is welcome.
1108
1109       Patches can be sent as GitHub pull requests at
1110       <https://github.com/steve-m-hay/Text-Balanced/pulls>.
1111
1112       Bug reports and suggestions can be made on the CPAN Request Tracker at
1113       <https://rt.cpan.org/Public/Bug/Report.html?Queue=Text-Balanced>.
1114
1115       Currently active requests on the CPAN Request Tracker can be viewed at
1116       <https://rt.cpan.org/Public/Dist/Display.html?Status=Active;Queue=Text-Balanced>.
1117
1118       Please test this distribution.  See CPAN Testers Reports at
1119       <https://www.cpantesters.org/> for details of how to get involved.
1120
1121       Previous test results on CPAN Testers Reports can be viewed at
1122       <https://www.cpantesters.org/distro/T/Text-Balanced.html>.
1123
1124       Please rate this distribution on CPAN Ratings at
1125       <https://cpanratings.perl.org/rate/?distribution=Text-Balanced>.
1126

AVAILABILITY

1128       The latest version of this module is available from CPAN (see "CPAN" in
1129       perlmodlib for details) at
1130
1131       <https://metacpan.org/release/Text-Balanced> or
1132
1133       <https://www.cpan.org/authors/id/S/SH/SHAY/> or
1134
1135       <https://www.cpan.org/modules/by-module/Text/>.
1136
1137       The latest source code is available from GitHub at
1138       <https://github.com/steve-m-hay/Text-Balanced>.
1139

INSTALLATION

1141       See the INSTALL file.
1142

AUTHOR

1144       Damian Conway <damian@conway.org <mailto:damian@conway.org>>.
1145
1146       Steve Hay <shay@cpan.org <mailto:shay@cpan.org>> is now maintaining
1147       Text::Balanced as of version 2.03.
1148

COPYRIGHT

1150       Copyright (C) 1997-2001 Damian Conway.  All rights reserved.
1151
1152       Copyright (C) 2009 Adam Kennedy.
1153
1154       Copyright (C) 2015, 2020 Steve Hay.  All rights reserved.
1155

LICENCE

1157       This module is free software; you can redistribute it and/or modify it
1158       under the same terms as Perl itself, i.e. under the terms of either the
1159       GNU General Public License or the Artistic License, as specified in the
1160       LICENCE file.
1161

VERSION

1163       Version 2.04
1164

DATE

1166       11 Dec 2020
1167

HISTORY

1169       See the Changes file.
1170
1171
1172
1173perl v5.32.1                      2020-12-14                 Text::Balanced(3)