1Text::Balanced(3)     User Contributed Perl Documentation    Text::Balanced(3)
2
3
4

NAME

6       Text::Balanced - Extract delimited text sequences from strings.
7

SYNOPSIS

9           use Text::Balanced qw (
10               extract_delimited
11               extract_bracketed
12               extract_quotelike
13               extract_codeblock
14               extract_variable
15               extract_tagged
16               extract_multiple
17               gen_delimited_pat
18               gen_extract_tagged
19           );
20
21           # Extract the initial substring of $text that is delimited by
22           # two (unescaped) instances of the first character in $delim.
23
24           ($extracted, $remainder) = extract_delimited($text,$delim);
25
26           # Extract the initial substring of $text that is bracketed
27           # with a delimiter(s) specified by $delim (where the string
28           # in $delim contains one or more of '(){}[]<>').
29
30           ($extracted, $remainder) = extract_bracketed($text,$delim);
31
32           # Extract the initial substring of $text that is bounded by
33           # an XML tag.
34
35           ($extracted, $remainder) = extract_tagged($text);
36
37           # Extract the initial substring of $text that is bounded by
38           # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
39
40           ($extracted, $remainder) =
41               extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
42
43           # Extract the initial substring of $text that represents a
44           # Perl "quote or quote-like operation"
45
46           ($extracted, $remainder) = extract_quotelike($text);
47
48           # Extract the initial substring of $text that represents a block
49           # of Perl code, bracketed by any of character(s) specified by $delim
50           # (where the string $delim contains one or more of '(){}[]<>').
51
52           ($extracted, $remainder) = extract_codeblock($text,$delim);
53
54           # Extract the initial substrings of $text that would be extracted by
55           # one or more sequential applications of the specified functions
56           # or regular expressions
57
58           @extracted = extract_multiple($text,
59                                         [ \&extract_bracketed,
60                                           \&extract_quotelike,
61                                           \&some_other_extractor_sub,
62                                           qr/[xyz]*/,
63                                           'literal',
64                                         ]);
65
66           # Create a string representing an optimized pattern (a la Friedl)
67           # that matches a substring delimited by any of the specified characters
68           # (in this case: any type of quote or a slash)
69
70           $patstring = gen_delimited_pat(q{'"`/});
71
72           # Generate a reference to an anonymous sub that is just like extract_tagged
73           # but pre-compiled and optimized for a specific pair of tags, and
74           # consequently much faster (i.e. 3 times faster). It uses qr// for better
75           # performance on repeated calls.
76
77           $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
78           ($extracted, $remainder) = $extract_head->($text);
79

DESCRIPTION

81       The various "extract_..." subroutines may be used to extract a
82       delimited substring, possibly after skipping a specified prefix string.
83       By default, that prefix is optional whitespace ("/\s*/"), but you can
84       change it to whatever you wish (see below).
85
86       The substring to be extracted must appear at the current "pos" location
87       of the string's variable (or at index zero, if no "pos" position is
88       defined).  In other words, the "extract_..." subroutines don't extract
89       the first occurrence of a substring anywhere in a string (like an
90       unanchored regex would). Rather, they extract an occurrence of the
91       substring appearing immediately at the current matching position in the
92       string (like a "\G"-anchored regex would).
93
94   General Behaviour in List Contexts
95       In a list context, all the subroutines return a list, the first three
96       elements of which are always:
97
98       [0] The extracted string, including the specified delimiters.  If the
99           extraction fails "undef" is returned.
100
101       [1] The remainder of the input string (i.e. the characters after the
102           extracted string). On failure, the entire string is returned.
103
104       [2] The skipped prefix (i.e. the characters before the extracted
105           string).  On failure, "undef" is returned.
106
107       Note that in a list context, the contents of the original input text
108       (the first argument) are not modified in any way.
109
110       However, if the input text was passed in a variable, that variable's
111       "pos" value is updated to point at the first character after the
112       extracted text. That means that in a list context the various
113       subroutines can be used much like regular expressions. For example:
114
115           while ( $next = (extract_quotelike($text))[0] )
116           {
117               # process next quote-like (in $next)
118           }
119
120   General Behaviour in Scalar and Void Contexts
121       In a scalar context, the extracted string is returned, having first
122       been removed from the input text. Thus, the following code also
123       processes each quote-like operation, but actually removes them from
124       $text:
125
126           while ( $next = extract_quotelike($text) )
127           {
128               # process next quote-like (in $next)
129           }
130
131       Note that if the input text is a read-only string (i.e. a literal), no
132       attempt is made to remove the extracted text.
133
134       In a void context the behaviour of the extraction subroutines is
135       exactly the same as in a scalar context, except (of course) that the
136       extracted substring is not returned.
137
138   A Note About Prefixes
139       Prefix patterns are matched without any trailing modifiers ("/gimsox"
140       etc.)  This can bite you if you're expecting a prefix specification
141       like '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a
142       prefix pattern will only succeed if the <H1> tag is on the current
143       line, since . normally doesn't match newlines.
144
145       To overcome this limitation, you need to turn on /s matching within the
146       prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)'
147
148   Functions
149       "extract_delimited"
150           The "extract_delimited" function formalizes the common idiom of
151           extracting a single-character-delimited substring from the start of
152           a string. For example, to extract a single-quote delimited string,
153           the following code is typically used:
154
155               ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
156               $extracted = $1;
157
158           but with "extract_delimited" it can be simplified to:
159
160               ($extracted,$remainder) = extract_delimited($text, "'");
161
162           "extract_delimited" takes up to four scalars (the input text, the
163           delimiters, a prefix pattern to be skipped, and any escape
164           characters) and extracts the initial substring of the text that is
165           appropriately delimited. If the delimiter string has multiple
166           characters, the first one encountered in the text is taken to
167           delimit the substring.  The third argument specifies a prefix
168           pattern that is to be skipped (but must be present!) before the
169           substring is extracted.  The final argument specifies the escape
170           character to be used for each delimiter.
171
172           All arguments are optional. If the escape characters are not
173           specified, every delimiter is escaped with a backslash ("\").  If
174           the prefix is not specified, the pattern '\s*' - optional
175           whitespace - is used. If the delimiter set is also not specified,
176           the set "/["'`]/" is used. If the text to be processed is not
177           specified either, $_ is used.
178
179           In list context, "extract_delimited" returns a array of three
180           elements, the extracted substring (including the surrounding
181           delimiters), the remainder of the text, and the skipped prefix (if
182           any). If a suitable delimited substring is not found, the first
183           element of the array is the empty string, the second is the
184           complete original text, and the prefix returned in the third
185           element is an empty string.
186
187           In a scalar context, just the extracted substring is returned. In a
188           void context, the extracted substring (and any prefix) are simply
189           removed from the beginning of the first argument.
190
191           Examples:
192
193               # Remove a single-quoted substring from the very beginning of $text:
194
195                   $substring = extract_delimited($text, "'", '');
196
197               # Remove a single-quoted Pascalish substring (i.e. one in which
198               # doubling the quote character escapes it) from the very
199               # beginning of $text:
200
201                   $substring = extract_delimited($text, "'", '', "'");
202
203               # Extract a single- or double- quoted substring from the
204               # beginning of $text, optionally after some whitespace
205               # (note the list context to protect $text from modification):
206
207                   ($substring) = extract_delimited $text, q{"'};
208
209               # Delete the substring delimited by the first '/' in $text:
210
211                   $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
212
213           Note that this last example is not the same as deleting the first
214           quote-like pattern. For instance, if $text contained the string:
215
216               "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
217
218           then after the deletion it would contain:
219
220               "if ('.$UNIXCMD/s) { $cmd = $1; }"
221
222           not:
223
224               "if ('./cmd' =~ ms) { $cmd = $1; }"
225
226           See "extract_quotelike" for a (partial) solution to this problem.
227
228       "extract_bracketed"
229           Like "extract_delimited", the "extract_bracketed" function takes up
230           to three optional scalar arguments: a string to extract from, a
231           delimiter specifier, and a prefix pattern. As before, a missing
232           prefix defaults to optional whitespace and a missing text defaults
233           to $_. However, a missing delimiter specifier defaults to
234           '{}()[]<>' (see below).
235
236           "extract_bracketed" extracts a balanced-bracket-delimited substring
237           (using any one (or more) of the user-specified delimiter brackets:
238           '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect
239           quoted unbalanced brackets (see below).
240
241           A "delimiter bracket" is a bracket in list of delimiters passed as
242           "extract_bracketed"'s second argument. Delimiter brackets are
243           specified by giving either the left or right (or both!) versions of
244           the required bracket(s). Note that the order in which two or more
245           delimiter brackets are specified is not significant.
246
247           A "balanced-bracket-delimited substring" is a substring bounded by
248           matched brackets, such that any other (left or right) delimiter
249           bracket within the substring is also matched by an opposite (right
250           or left) delimiter bracket at the same level of nesting. Any type
251           of bracket not in the delimiter list is treated as an ordinary
252           character.
253
254           In other words, each type of bracket specified as a delimiter must
255           be balanced and correctly nested within the substring, and any
256           other kind of ("non-delimiter") bracket in the substring is
257           ignored.
258
259           For example, given the string:
260
261               $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
262
263           then a call to "extract_bracketed" in a list context:
264
265               @result = extract_bracketed( $text, '{}' );
266
267           would return:
268
269               ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
270
271           since both sets of '{..}' brackets are properly nested and evenly
272           balanced.  (In a scalar context just the first element of the array
273           would be returned. In a void context, $text would be replaced by an
274           empty string.)
275
276           Likewise the call in:
277
278               @result = extract_bracketed( $text, '{[' );
279
280           would return the same result, since all sets of both types of
281           specified delimiter brackets are correctly nested and balanced.
282
283           However, the call in:
284
285               @result = extract_bracketed( $text, '{([<' );
286
287           would fail, returning:
288
289               ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }"  );
290
291           because the embedded pairs of '(..)'s and '[..]'s are "cross-
292           nested" and the embedded '>' is unbalanced. (In a scalar context,
293           this call would return an empty string. In a void context, $text
294           would be unchanged.)
295
296           Note that the embedded single-quotes in the string don't help in
297           this case, since they have not been specified as acceptable
298           delimiters and are therefore treated as non-delimiter characters
299           (and ignored).
300
301           However, if a particular species of quote character is included in
302           the delimiter specification, then that type of quote will be
303           correctly handled.  for example, if $text is:
304
305               $text = '<A HREF=">>>>">link</A>';
306
307           then
308
309               @result = extract_bracketed( $text, '<">' );
310
311           returns:
312
313               ( '<A HREF=">>>>">', 'link</A>', "" )
314
315           as expected. Without the specification of """ as an embedded
316           quoter:
317
318               @result = extract_bracketed( $text, '<>' );
319
320           the result would be:
321
322               ( '<A HREF=">', '>>>">link</A>', "" )
323
324           In addition to the quote delimiters "'", """, and "`", full Perl
325           quote-like quoting (i.e. q{string}, qq{string}, etc) can be
326           specified by including the letter 'q' as a delimiter. Hence:
327
328               @result = extract_bracketed( $text, '<q>' );
329
330           would correctly match something like this:
331
332               $text = '<leftop: conj /and/ conj>';
333
334           See also: "extract_quotelike" and "extract_codeblock".
335
336       "extract_variable"
337           "extract_variable" extracts any valid Perl variable or variable-
338           involved expression, including scalars, arrays, hashes, array
339           accesses, hash look-ups, method calls through objects, subroutine
340           calls through subroutine references, etc.
341
342           The subroutine takes up to two optional arguments:
343
344           1.  A string to be processed ($_ if the string is omitted or
345               "undef")
346
347           2.  A string specifying a pattern to be matched as a prefix (which
348               is to be skipped). If omitted, optional whitespace is skipped.
349
350           On success in a list context, an array of 3 elements is returned.
351           The elements are:
352
353           [0] the extracted variable, or variablish expression
354
355           [1] the remainder of the input text,
356
357           [2] the prefix substring (if any),
358
359           On failure, all of these values (except the remaining text) are
360           "undef".
361
362           In a scalar context, "extract_variable" returns just the complete
363           substring that matched a variablish expression. "undef" is returned
364           on failure. In addition, the original input text has the returned
365           substring (and any prefix) removed from it.
366
367           In a void context, the input text just has the matched substring
368           (and any specified prefix) removed.
369
370       "extract_tagged"
371           "extract_tagged" extracts and segments text between (balanced)
372           specified tags.
373
374           The subroutine takes up to five optional arguments:
375
376           1.  A string to be processed ($_ if the string is omitted or
377               "undef")
378
379           2.  A string specifying a pattern (i.e. regex) to be matched as the
380               opening tag.  If the pattern string is omitted (or "undef")
381               then a pattern that matches any standard XML tag is used.
382
383           3.  A string specifying a pattern to be matched at the closing tag.
384               If the pattern string is omitted (or "undef") then the closing
385               tag is constructed by inserting a "/" after any leading bracket
386               characters in the actual opening tag that was matched (not the
387               pattern that matched the tag). For example, if the opening tag
388               pattern is specified as '{{\w+}}' and actually matched the
389               opening tag "{{DATA}}", then the constructed closing tag would
390               be "{{/DATA}}".
391
392           4.  A string specifying a pattern to be matched as a prefix (which
393               is to be skipped). If omitted, optional whitespace is skipped.
394
395           5.  A hash reference containing various parsing options (see below)
396
397           The various options that can be specified are:
398
399           "reject => $listref"
400               The list reference contains one or more strings specifying
401               patterns that must not appear within the tagged text.
402
403               For example, to extract an HTML link (which should not contain
404               nested links) use:
405
406                       extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
407
408           "ignore => $listref"
409               The list reference contains one or more strings specifying
410               patterns that are not to be treated as nested tags within the
411               tagged text (even if they would match the start tag pattern).
412
413               For example, to extract an arbitrary XML tag, but ignore
414               "empty" elements:
415
416                       extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
417
418               (also see "gen_delimited_pat" below).
419
420           "fail => $str"
421               The "fail" option indicates the action to be taken if a
422               matching end tag is not encountered (i.e. before the end of the
423               string or some "reject" pattern matches). By default, a failure
424               to match a closing tag causes "extract_tagged" to immediately
425               fail.
426
427               However, if the string value associated with <reject> is "MAX",
428               then "extract_tagged" returns the complete text up to the point
429               of failure.  If the string is "PARA", "extract_tagged" returns
430               only the first paragraph after the tag (up to the first line
431               that is either empty or contains only whitespace characters).
432               If the string is "", the default behaviour (i.e. failure) is
433               reinstated.
434
435               For example, suppose the start tag "/para" introduces a
436               paragraph, which then continues until the next "/endpara" tag
437               or until another "/para" tag is encountered:
438
439                       $text = "/para line 1\n\nline 3\n/para line 4";
440
441                       extract_tagged($text, '/para', '/endpara', undef,
442                                               {reject => '/para', fail => MAX );
443
444                       # EXTRACTED: "/para line 1\n\nline 3\n"
445
446               Suppose instead, that if no matching "/endpara" tag is found,
447               the "/para" tag refers only to the immediately following
448               paragraph:
449
450                       $text = "/para line 1\n\nline 3\n/para line 4";
451
452                       extract_tagged($text, '/para', '/endpara', undef,
453                                       {reject => '/para', fail => MAX );
454
455                       # EXTRACTED: "/para line 1\n"
456
457               Note that the specified "fail" behaviour applies to nested tags
458               as well.
459
460           On success in a list context, an array of 6 elements is returned.
461           The elements are:
462
463           [0] the extracted tagged substring (including the outermost tags),
464
465           [1] the remainder of the input text,
466
467           [2] the prefix substring (if any),
468
469           [3] the opening tag
470
471           [4] the text between the opening and closing tags
472
473           [5] the closing tag (or "" if no closing tag was found)
474
475           On failure, all of these values (except the remaining text) are
476           "undef".
477
478           In a scalar context, "extract_tagged" returns just the complete
479           substring that matched a tagged text (including the start and end
480           tags). "undef" is returned on failure. In addition, the original
481           input text has the returned substring (and any prefix) removed from
482           it.
483
484           In a void context, the input text just has the matched substring
485           (and any specified prefix) removed.
486
487       "gen_extract_tagged"
488           "gen_extract_tagged" generates a new anonymous subroutine which
489           extracts text between (balanced) specified tags. In other words, it
490           generates a function identical in function to "extract_tagged".
491
492           The difference between "extract_tagged" and the anonymous
493           subroutines generated by "gen_extract_tagged", is that those
494           generated subroutines:
495
496           •   do not have to reparse tag specification or parsing options
497               every time they are called (whereas "extract_tagged" has to
498               effectively rebuild its tag parser on every call);
499
500           •   make use of the new qr// construct to pre-compile the regexes
501               they use (whereas "extract_tagged" uses standard string
502               variable interpolation to create tag-matching patterns).
503
504           The subroutine takes up to four optional arguments (the same set as
505           "extract_tagged" except for the string to be processed). It returns
506           a reference to a subroutine which in turn takes a single argument
507           (the text to be extracted from).
508
509           In other words, the implementation of "extract_tagged" is exactly
510           equivalent to:
511
512                   sub extract_tagged
513                   {
514                           my $text = shift;
515                           $extractor = gen_extract_tagged(@_);
516                           return $extractor->($text);
517                   }
518
519           (although "extract_tagged" is not currently implemented that way).
520
521           Using "gen_extract_tagged" to create extraction functions for
522           specific tags is a good idea if those functions are going to be
523           called more than once, since their performance is typically twice
524           as good as the more general-purpose "extract_tagged".
525
526       "extract_quotelike"
527           "extract_quotelike" attempts to recognize, extract, and segment any
528           one of the various Perl quotes and quotelike operators (see
529           perlop(3)) Nested backslashed delimiters, embedded balanced bracket
530           delimiters (for the quotelike operators), and trailing modifiers
531           are all caught. For example, in:
532
533                   extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
534
535                   extract_quotelike '  "You said, \"Use sed\"."  '
536
537                   extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
538
539                   extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
540
541           the full Perl quotelike operations are all extracted correctly.
542
543           Note too that, when using the /x modifier on a regex, any comment
544           containing the current pattern delimiter will cause the regex to be
545           immediately terminated. In other words:
546
547                   'm /
548                           (?i)            # CASE INSENSITIVE
549                           [a-z_]          # LEADING ALPHABETIC/UNDERSCORE
550                           [a-z0-9]*       # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
551                      /x'
552
553           will be extracted as if it were:
554
555                   'm /
556                           (?i)            # CASE INSENSITIVE
557                           [a-z_]          # LEADING ALPHABETIC/'
558
559           This behaviour is identical to that of the actual compiler.
560
561           "extract_quotelike" takes two arguments: the text to be processed
562           and a prefix to be matched at the very beginning of the text. If no
563           prefix is specified, optional whitespace is the default. If no text
564           is given, $_ is used.
565
566           In a list context, an array of 11 elements is returned. The
567           elements are:
568
569           [0] the extracted quotelike substring (including trailing
570               modifiers),
571
572           [1] the remainder of the input text,
573
574           [2] the prefix substring (if any),
575
576           [3] the name of the quotelike operator (if any),
577
578           [4] the left delimiter of the first block of the operation,
579
580           [5] the text of the first block of the operation (that is, the
581               contents of a quote, the regex of a match or substitution or
582               the target list of a translation),
583
584           [6] the right delimiter of the first block of the operation,
585
586           [7] the left delimiter of the second block of the operation (that
587               is, if it is a "s", "tr", or "y"),
588
589           [8] the text of the second block of the operation (that is, the
590               replacement of a substitution or the translation list of a
591               translation),
592
593           [9] the right delimiter of the second block of the operation (if
594               any),
595
596           [10]
597               the trailing modifiers on the operation (if any).
598
599           For each of the fields marked "(if any)" the default value on
600           success is an empty string.  On failure, all of these values
601           (except the remaining text) are "undef".
602
603           In a scalar context, "extract_quotelike" returns just the complete
604           substring that matched a quotelike operation (or "undef" on
605           failure). In a scalar or void context, the input text has the same
606           substring (and any specified prefix) removed.
607
608           Examples:
609
610                   # Remove the first quotelike literal that appears in text
611
612                           $quotelike = extract_quotelike($text,'.*?');
613
614                   # Replace one or more leading whitespace-separated quotelike
615                   # literals in $_ with "<QLL>"
616
617                           do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
618
619
620                   # Isolate the search pattern in a quotelike operation from $text
621
622                           ($op,$pat) = (extract_quotelike $text)[3,5];
623                           if ($op =~ /[ms]/)
624                           {
625                                   print "search pattern: $pat\n";
626                           }
627                           else
628                           {
629                                   print "$op is not a pattern matching operation\n";
630                           }
631
632       "extract_quotelike"
633           "extract_quotelike" can successfully extract "here documents" from
634           an input string, but with an important caveat in list contexts.
635
636           Unlike other types of quote-like literals, a here document is
637           rarely a contiguous substring. For example, a typical piece of code
638           using here document might look like this:
639
640                   <<'EOMSG' || die;
641                   This is the message.
642                   EOMSG
643                   exit;
644
645           Given this as an input string in a scalar context,
646           "extract_quotelike" would correctly return the string
647           "<<'EOMSG'\nThis is the message.\nEOMSG", leaving the string " ||
648           die;\nexit;" in the original variable. In other words, the two
649           separate pieces of the here document are successfully extracted and
650           concatenated.
651
652           In a list context, "extract_quotelike" would return the list
653
654           [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full
655               extracted here document, including fore and aft delimiters),
656
657           [1] " || die;\nexit;" (i.e. the remainder of the input text,
658               concatenated),
659
660           [2] "" (i.e. the prefix substring -- trivial in this case),
661
662           [3] "<<" (i.e. the "name" of the quotelike operator)
663
664           [4] "'EOMSG'" (i.e. the left delimiter of the here document,
665               including any quotes),
666
667           [5] "This is the message.\n" (i.e. the text of the here document),
668
669           [6] "EOMSG" (i.e. the right delimiter of the here document),
670
671           [7..10]
672               "" (a here document has no second left delimiter, second text,
673               second right delimiter, or trailing modifiers).
674
675           However, the matching position of the input variable would be set
676           to "exit;" (i.e. after the closing delimiter of the here document),
677           which would cause the earlier " || die;\nexit;" to be skipped in
678           any sequence of code fragment extractions.
679
680           To avoid this problem, when it encounters a here document whilst
681           extracting from a modifiable string, "extract_quotelike" silently
682           rearranges the string to an equivalent piece of Perl:
683
684                   <<'EOMSG'
685                   This is the message.
686                   EOMSG
687                   || die;
688                   exit;
689
690           in which the here document is contiguous. It still leaves the
691           matching position after the here document, but now the rest of the
692           line on which the here document starts is not skipped.
693
694           To prevent <extract_quotelike> from mucking about with the input in
695           this way (this is the only case where a list-context
696           "extract_quotelike" does so), you can pass the input variable as an
697           interpolated literal:
698
699                   $quotelike = extract_quotelike("$var");
700
701       "extract_codeblock"
702           "extract_codeblock" attempts to recognize and extract a balanced
703           bracket delimited substring that may contain unbalanced brackets
704           inside Perl quotes or quotelike operations. That is,
705           "extract_codeblock" is like a combination of "extract_bracketed"
706           and "extract_quotelike".
707
708           "extract_codeblock" takes the same initial three parameters as
709           "extract_bracketed": a text to process, a set of delimiter brackets
710           to look for, and a prefix to match first. It also takes an optional
711           fourth parameter, which allows the outermost delimiter brackets to
712           be specified separately (see below), and a fifth parameter used
713           only by Parse::RecDescent.
714
715           Omitting the first argument (input text) means process $_ instead.
716           Omitting the second argument (delimiter brackets) indicates that
717           only '{' is to be used.  Omitting the third argument (prefix
718           argument) implies optional whitespace at the start.  Omitting the
719           fourth argument (outermost delimiter brackets) indicates that the
720           value of the second argument is to be used for the outermost
721           delimiters.
722
723           Once the prefix and the outermost opening delimiter bracket have
724           been recognized, code blocks are extracted by stepping through the
725           input text and trying the following alternatives in sequence:
726
727           1.  Try and match a closing delimiter bracket. If the bracket was
728               the same species as the last opening bracket, return the
729               substring to that point. If the bracket was mismatched, return
730               an error.
731
732           2.  Try to match a quote or quotelike operator. If found, call
733               "extract_quotelike" to eat it. If "extract_quotelike" fails,
734               return the error it returned. Otherwise go back to step 1.
735
736           3.  Try to match an opening delimiter bracket. If found, call
737               "extract_codeblock" recursively to eat the embedded block. If
738               the recursive call fails, return an error. Otherwise, go back
739               to step 1.
740
741           4.  Unconditionally match a bareword or any other single character,
742               and then go back to step 1.
743
744           Examples:
745
746                   # Find a while loop in the text
747
748                           if ($text =~ s/.*?while\s*\{/{/)
749                           {
750                                   $loop = "while " . extract_codeblock($text);
751                           }
752
753                   # Remove the first round-bracketed list (which may include
754                   # round- or curly-bracketed code blocks or quotelike operators)
755
756                           extract_codeblock $text, "(){}", '[^(]*';
757
758           The ability to specify a different outermost delimiter bracket is
759           useful in some circumstances. For example, in the Parse::RecDescent
760           module, parser actions which are to be performed only on a
761           successful parse are specified using a "<defer:...>" directive. For
762           example:
763
764                   sentence: subject verb object
765                                   <defer: {$::theVerb = $item{verb}} >
766
767           Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to
768           extract the code within the "<defer:...>" directive, but there's a
769           problem.
770
771           A deferred action like this:
772
773                                   <defer: {if ($count>10) {$count--}} >
774
775           will be incorrectly parsed as:
776
777                                   <defer: {if ($count>
778
779           because the "less than" operator is interpreted as a closing
780           delimiter.
781
782           But, by extracting the directive using
783           "extract_codeblock($text, '{}', undef, '<>')" the '>' character is
784           only treated as a delimited at the outermost level of the code
785           block, so the directive is parsed correctly.
786
787       "extract_multiple"
788           The "extract_multiple" subroutine takes a string to be processed
789           and a list of extractors (subroutines or regular expressions) to
790           apply to that string.
791
792           In an array context "extract_multiple" returns an array of
793           substrings of the original string, as extracted by the specified
794           extractors.  In a scalar context, "extract_multiple" returns the
795           first substring successfully extracted from the original string. In
796           both scalar and void contexts the original string has the first
797           successfully extracted substring removed from it. In all contexts
798           "extract_multiple" starts at the current "pos" of the string, and
799           sets that "pos" appropriately after it matches.
800
801           Hence, the aim of a call to "extract_multiple" in a list context is
802           to split the processed string into as many non-overlapping fields
803           as possible, by repeatedly applying each of the specified
804           extractors to the remainder of the string. Thus "extract_multiple"
805           is a generalized form of Perl's "split" subroutine.
806
807           The subroutine takes up to four optional arguments:
808
809           1.  A string to be processed ($_ if the string is omitted or
810               "undef")
811
812           2.  A reference to a list of subroutine references and/or qr//
813               objects and/or literal strings and/or hash references,
814               specifying the extractors to be used to split the string. If
815               this argument is omitted (or "undef") the list:
816
817                       [
818                               sub { extract_variable($_[0], '') },
819                               sub { extract_quotelike($_[0],'') },
820                               sub { extract_codeblock($_[0],'{}','') },
821                       ]
822
823               is used.
824
825           3.  An number specifying the maximum number of fields to return. If
826               this argument is omitted (or "undef"), split continues as long
827               as possible.
828
829               If the third argument is N, then extraction continues until N
830               fields have been successfully extracted, or until the string
831               has been completely processed.
832
833               Note that in scalar and void contexts the value of this
834               argument is automatically reset to 1 (under "-w", a warning is
835               issued if the argument has to be reset).
836
837           4.  A value indicating whether unmatched substrings (see below)
838               within the text should be skipped or returned as fields. If the
839               value is true, such substrings are skipped. Otherwise, they are
840               returned.
841
842           The extraction process works by applying each extractor in sequence
843           to the text string.
844
845           If the extractor is a subroutine it is called in a list context and
846           is expected to return a list of a single element, namely the
847           extracted text. It may optionally also return two further
848           arguments: a string representing the text left after extraction
849           (like $' for a pattern match), and a string representing any prefix
850           skipped before the extraction (like $` in a pattern match). Note
851           that this is designed to facilitate the use of other Text::Balanced
852           subroutines with "extract_multiple". Note too that the value
853           returned by an extractor subroutine need not bear any relationship
854           to the corresponding substring of the original text (see examples
855           below).
856
857           If the extractor is a precompiled regular expression or a string,
858           it is matched against the text in a scalar context with a leading
859           '\G' and the gc modifiers enabled. The extracted value is either $1
860           if that variable is defined after the match, or else the complete
861           match (i.e. $&).
862
863           If the extractor is a hash reference, it must contain exactly one
864           element.  The value of that element is one of the above extractor
865           types (subroutine reference, regular expression, or string).  The
866           key of that element is the name of a class into which the
867           successful return value of the extractor will be blessed.
868
869           If an extractor returns a defined value, that value is immediately
870           treated as the next extracted field and pushed onto the list of
871           fields.  If the extractor was specified in a hash reference, the
872           field is also blessed into the appropriate class,
873
874           If the extractor fails to match (in the case of a regex extractor),
875           or returns an empty list or an undefined value (in the case of a
876           subroutine extractor), it is assumed to have failed to extract.  If
877           none of the extractor subroutines succeeds, then one character is
878           extracted from the start of the text and the extraction subroutines
879           reapplied. Characters which are thus removed are accumulated and
880           eventually become the next field (unless the fourth argument is
881           true, in which case they are discarded).
882
883           For example, the following extracts substrings that are valid Perl
884           variables:
885
886                   @fields = extract_multiple($text,
887                                              [ sub { extract_variable($_[0]) } ],
888                                              undef, 1);
889
890           This example separates a text into fields which are quote
891           delimited, curly bracketed, and anything else. The delimited and
892           bracketed parts are also blessed to identify them (the "anything
893           else" is unblessed):
894
895                   @fields = extract_multiple($text,
896                              [
897                                   { Delim => sub { extract_delimited($_[0],q{'"}) } },
898                                   { Brack => sub { extract_bracketed($_[0],'{}') } },
899                              ]);
900
901           This call extracts the next single substring that is a valid Perl
902           quotelike operator (and removes it from $text):
903
904                   $quotelike = extract_multiple($text,
905                                                 [
906                                                   sub { extract_quotelike($_[0]) },
907                                                 ], undef, 1);
908
909           Finally, here is yet another way to do comma-separated value
910           parsing:
911
912                   $csv_text = "a,'x b',c";
913                   @fields = extract_multiple($csv_text,
914                                             [
915                                                   sub { extract_delimited($_[0],q{'"}) },
916                                                   qr/([^,]+)/,
917                                             ],
918                                             undef,1);
919                   # @fields is now ('a', "'x b'", 'c')
920
921           The list in the second argument means: "Try and extract a ' or "
922           delimited string, otherwise extract anything up to a comma...".
923           The undef third argument means: "...as many times as possible...",
924           and the true value in the fourth argument means "...discarding
925           anything else that appears (i.e. the commas)".
926
927           If you wanted the commas preserved as separate fields (i.e. like
928           split does if your split pattern has capturing parentheses), you
929           would just make the last parameter undefined (or remove it).
930
931       "gen_delimited_pat"
932           The "gen_delimited_pat" subroutine takes a single (string) argument
933           and builds a Friedl-style optimized regex that matches a string
934           delimited by any one of the characters in the single argument. For
935           example:
936
937                   gen_delimited_pat(q{'"})
938
939           returns the regex:
940
941                   (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
942
943           Note that the specified delimiters are automatically quotemeta'd.
944
945           A typical use of "gen_delimited_pat" would be to build special
946           purpose tags for "extract_tagged". For example, to properly ignore
947           "empty" XML elements (which might contain quoted strings):
948
949                   my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
950
951                   extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
952
953           "gen_delimited_pat" may also be called with an optional second
954           argument, which specifies the "escape" character(s) to be used for
955           each delimiter.  For example to match a Pascal-style string (where
956           ' is the delimiter and '' is a literal ' within the string):
957
958                   gen_delimited_pat(q{'},q{'});
959
960           Different escape characters can be specified for different
961           delimiters.  For example, to specify that '/' is the escape for
962           single quotes and '%' is the escape for double quotes:
963
964                   gen_delimited_pat(q{'"},q{/%});
965
966           If more delimiters than escape chars are specified, the last escape
967           char is used for the remaining delimiters.  If no escape char is
968           specified for a given specified delimiter, '\' is used.
969
970       "delimited_pat"
971           Note that "gen_delimited_pat" was previously called
972           "delimited_pat".  That name may still be used, but is now
973           deprecated.
974

DIAGNOSTICS

976       In a list context, all the functions return "(undef,$original_text)" on
977       failure. In a scalar context, failure is indicated by returning "undef"
978       (in this case the input text is not modified in any way).
979
980       In addition, on failure in any context, the $@ variable is set.
981       Accessing "$@->{error}" returns one of the error diagnostics listed
982       below.  Accessing "$@->{pos}" returns the offset into the original
983       string at which the error was detected (although not necessarily where
984       it occurred!)  Printing $@ directly produces the error message, with
985       the offset appended.  On success, the $@ variable is guaranteed to be
986       "undef".
987
988       The available diagnostics are:
989
990       "Did not find a suitable bracket: "%s""
991           The delimiter provided to "extract_bracketed" was not one of
992           '()[]<>{}'.
993
994       "Did not find prefix: /%s/"
995           A non-optional prefix was specified but wasn't found at the start
996           of the text.
997
998       "Did not find opening bracket after prefix: "%s""
999           "extract_bracketed" or "extract_codeblock" was expecting a
1000           particular kind of bracket at the start of the text, and didn't
1001           find it.
1002
1003       "No quotelike operator found after prefix: "%s""
1004           "extract_quotelike" didn't find one of the quotelike operators "q",
1005           "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it
1006           was extracting.
1007
1008       "Unmatched closing bracket: "%c""
1009           "extract_bracketed", "extract_quotelike" or "extract_codeblock"
1010           encountered a closing bracket where none was expected.
1011
1012       "Unmatched opening bracket(s): "%s""
1013           "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran
1014           out of characters in the text before closing one or more levels of
1015           nested brackets.
1016
1017       "Unmatched embedded quote (%s)"
1018           "extract_bracketed" attempted to match an embedded quoted
1019           substring, but failed to find a closing quote to match it.
1020
1021       "Did not find closing delimiter to match '%s'"
1022           "extract_quotelike" was unable to find a closing delimiter to match
1023           the one that opened the quote-like operation.
1024
1025       "Mismatched closing bracket: expected "%c" but found "%s""
1026           "extract_bracketed", "extract_quotelike" or "extract_codeblock"
1027           found a valid bracket delimiter, but it was the wrong species. This
1028           usually indicates a nesting error, but may indicate incorrect
1029           quoting or escaping.
1030
1031       "No block delimiter found after quotelike "%s""
1032           "extract_quotelike" or "extract_codeblock" found one of the
1033           quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without
1034           a suitable block after it.
1035
1036       "Did not find leading dereferencer"
1037           "extract_variable" was expecting one of '$', '@', or '%' at the
1038           start of a variable, but didn't find any of them.
1039
1040       "Bad identifier after dereferencer"
1041           "extract_variable" found a '$', '@', or '%' indicating a variable,
1042           but that character was not followed by a legal Perl identifier.
1043
1044       "Did not find expected opening bracket at %s"
1045           "extract_codeblock" failed to find any of the outermost opening
1046           brackets that were specified.
1047
1048       "Improperly nested codeblock at %s"
1049           A nested code block was found that started with a delimiter that
1050           was specified as being only to be used as an outermost bracket.
1051
1052       "Missing second block for quotelike "%s""
1053           "extract_codeblock" or "extract_quotelike" found one of the
1054           quotelike operators "s", "tr" or "y" followed by only one block.
1055
1056       "No match found for opening bracket"
1057           "extract_codeblock" failed to find a closing bracket to match the
1058           outermost opening bracket.
1059
1060       "Did not find opening tag: /%s/"
1061           "extract_tagged" did not find a suitable opening tag (after any
1062           specified prefix was removed).
1063
1064       "Unable to construct closing tag to match: /%s/"
1065           "extract_tagged" matched the specified opening tag and tried to
1066           modify the matched text to produce a matching closing tag (because
1067           none was specified). It failed to generate the closing tag, almost
1068           certainly because the opening tag did not start with a bracket of
1069           some kind.
1070
1071       "Found invalid nested tag: %s"
1072           "extract_tagged" found a nested tag that appeared in the "reject"
1073           list (and the failure mode was not "MAX" or "PARA").
1074
1075       "Found unbalanced nested tag: %s"
1076           "extract_tagged" found a nested opening tag that was not matched by
1077           a corresponding nested closing tag (and the failure mode was not
1078           "MAX" or "PARA").
1079
1080       "Did not find closing tag"
1081           "extract_tagged" reached the end of the text without finding a
1082           closing tag to match the original opening tag (and the failure mode
1083           was not "MAX" or "PARA").
1084

EXPORTS

1086       The following symbols are, or can be, exported by this module:
1087
1088       Default Exports
1089           None.
1090
1091       Optional Exports
1092           "extract_delimited", "extract_bracketed", "extract_quotelike",
1093           "extract_codeblock", "extract_variable", "extract_tagged",
1094           "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",
1095           "delimited_pat".
1096
1097       Export Tags
1098           ":ALL"
1099               "extract_delimited", "extract_bracketed", "extract_quotelike",
1100               "extract_codeblock", "extract_variable", "extract_tagged",
1101               "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",
1102               "delimited_pat".
1103

KNOWN BUGS

1105       See
1106       <https://rt.cpan.org/Dist/Display.html?Status=Active&Queue=Text-Balanced>.
1107

FEEDBACK

1109       Patches, bug reports, suggestions or any other feedback is welcome.
1110
1111       Patches can be sent as GitHub pull requests at
1112       <https://github.com/steve-m-hay/Text-Balanced/pulls>.
1113
1114       Bug reports and suggestions can be made on the CPAN Request Tracker at
1115       <https://rt.cpan.org/Public/Bug/Report.html?Queue=Text-Balanced>.
1116
1117       Currently active requests on the CPAN Request Tracker can be viewed at
1118       <https://rt.cpan.org/Public/Dist/Display.html?Status=Active;Queue=Text-Balanced>.
1119
1120       Please test this distribution.  See CPAN Testers Reports at
1121       <https://www.cpantesters.org/> for details of how to get involved.
1122
1123       Previous test results on CPAN Testers Reports can be viewed at
1124       <https://www.cpantesters.org/distro/T/Text-Balanced.html>.
1125
1126       Please rate this distribution on CPAN Ratings at
1127       <https://cpanratings.perl.org/rate/?distribution=Text-Balanced>.
1128

AVAILABILITY

1130       The latest version of this module is available from CPAN (see "CPAN" in
1131       perlmodlib for details) at
1132
1133       <https://metacpan.org/release/Text-Balanced> or
1134
1135       <https://www.cpan.org/authors/id/S/SH/SHAY/> or
1136
1137       <https://www.cpan.org/modules/by-module/Text/>.
1138
1139       The latest source code is available from GitHub at
1140       <https://github.com/steve-m-hay/Text-Balanced>.
1141

INSTALLATION

1143       See the INSTALL file.
1144

AUTHOR

1146       Damian Conway <damian@conway.org <mailto:damian@conway.org>>.
1147
1148       Steve Hay <shay@cpan.org <mailto:shay@cpan.org>> is now maintaining
1149       Text::Balanced as of version 2.03.
1150
1152       Copyright (C) 1997-2001 Damian Conway.  All rights reserved.
1153
1154       Copyright (C) 2009 Adam Kennedy.
1155
1156       Copyright (C) 2015, 2020, 2022 Steve Hay and other contributors.  All
1157       rights reserved.
1158

LICENCE

1160       This module is free software; you can redistribute it and/or modify it
1161       under the same terms as Perl itself, i.e. under the terms of either the
1162       GNU General Public License or the Artistic License, as specified in the
1163       LICENCE file.
1164

VERSION

1166       Version 2.06
1167

DATE

1169       05 Jun 2022
1170

HISTORY

1172       See the Changes file.
1173
1174
1175
1176perl v5.36.0                      2023-03-09                 Text::Balanced(3)
Impressum