1Text::Balanced(3) User Contributed Perl Documentation Text::Balanced(3)
2
3
4
6 Text::Balanced - Extract delimited text sequences from strings.
7
9 use Text::Balanced qw (
10 extract_delimited
11 extract_bracketed
12 extract_quotelike
13 extract_codeblock
14 extract_variable
15 extract_tagged
16 extract_multiple
17 gen_delimited_pat
18 gen_extract_tagged
19 );
20
21 # Extract the initial substring of $text that is delimited by
22 # two (unescaped) instances of the first character in $delim.
23
24 ($extracted, $remainder) = extract_delimited($text,$delim);
25
26 # Extract the initial substring of $text that is bracketed
27 # with a delimiter(s) specified by $delim (where the string
28 # in $delim contains one or more of '(){}[]<>').
29
30 ($extracted, $remainder) = extract_bracketed($text,$delim);
31
32 # Extract the initial substring of $text that is bounded by
33 # an XML tag.
34
35 ($extracted, $remainder) = extract_tagged($text);
36
37 # Extract the initial substring of $text that is bounded by
38 # a C<BEGIN>...C<END> pair. Don't allow nested C<BEGIN> tags
39
40 ($extracted, $remainder) =
41 extract_tagged($text,"BEGIN","END",undef,{bad=>["BEGIN"]});
42
43 # Extract the initial substring of $text that represents a
44 # Perl "quote or quote-like operation"
45
46 ($extracted, $remainder) = extract_quotelike($text);
47
48 # Extract the initial substring of $text that represents a block
49 # of Perl code, bracketed by any of character(s) specified by $delim
50 # (where the string $delim contains one or more of '(){}[]<>').
51
52 ($extracted, $remainder) = extract_codeblock($text,$delim);
53
54 # Extract the initial substrings of $text that would be extracted by
55 # one or more sequential applications of the specified functions
56 # or regular expressions
57
58 @extracted = extract_multiple($text,
59 [ \&extract_bracketed,
60 \&extract_quotelike,
61 \&some_other_extractor_sub,
62 qr/[xyz]*/,
63 'literal',
64 ]);
65
66 # Create a string representing an optimized pattern (a la Friedl)
67 # that matches a substring delimited by any of the specified characters
68 # (in this case: any type of quote or a slash)
69
70 $patstring = gen_delimited_pat(q{'"`/});
71
72 # Generate a reference to an anonymous sub that is just like extract_tagged
73 # but pre-compiled and optimized for a specific pair of tags, and
74 # consequently much faster (i.e. 3 times faster). It uses qr// for better
75 # performance on repeated calls.
76
77 $extract_head = gen_extract_tagged('<HEAD>','</HEAD>');
78 ($extracted, $remainder) = $extract_head->($text);
79
81 The various "extract_..." subroutines may be used to extract a
82 delimited substring, possibly after skipping a specified prefix string.
83 By default, that prefix is optional whitespace ("/\s*/"), but you can
84 change it to whatever you wish (see below).
85
86 The substring to be extracted must appear at the current "pos" location
87 of the string's variable (or at index zero, if no "pos" position is
88 defined). In other words, the "extract_..." subroutines don't extract
89 the first occurrence of a substring anywhere in a string (like an
90 unanchored regex would). Rather, they extract an occurrence of the
91 substring appearing immediately at the current matching position in the
92 string (like a "\G"-anchored regex would).
93
94 General Behaviour in List Contexts
95 In a list context, all the subroutines return a list, the first three
96 elements of which are always:
97
98 [0] The extracted string, including the specified delimiters. If the
99 extraction fails "undef" is returned.
100
101 [1] The remainder of the input string (i.e. the characters after the
102 extracted string). On failure, the entire string is returned.
103
104 [2] The skipped prefix (i.e. the characters before the extracted
105 string). On failure, "undef" is returned.
106
107 Note that in a list context, the contents of the original input text
108 (the first argument) are not modified in any way.
109
110 However, if the input text was passed in a variable, that variable's
111 "pos" value is updated to point at the first character after the
112 extracted text. That means that in a list context the various
113 subroutines can be used much like regular expressions. For example:
114
115 while ( $next = (extract_quotelike($text))[0] )
116 {
117 # process next quote-like (in $next)
118 }
119
120 General Behaviour in Scalar and Void Contexts
121 In a scalar context, the extracted string is returned, having first
122 been removed from the input text. Thus, the following code also
123 processes each quote-like operation, but actually removes them from
124 $text:
125
126 while ( $next = extract_quotelike($text) )
127 {
128 # process next quote-like (in $next)
129 }
130
131 Note that if the input text is a read-only string (i.e. a literal), no
132 attempt is made to remove the extracted text.
133
134 In a void context the behaviour of the extraction subroutines is
135 exactly the same as in a scalar context, except (of course) that the
136 extracted substring is not returned.
137
138 A Note About Prefixes
139 Prefix patterns are matched without any trailing modifiers ("/gimsox"
140 etc.) This can bite you if you're expecting a prefix specification
141 like '.*?(?=<H1>)' to skip everything up to the first <H1> tag. Such a
142 prefix pattern will only succeed if the <H1> tag is on the current
143 line, since . normally doesn't match newlines.
144
145 To overcome this limitation, you need to turn on /s matching within the
146 prefix pattern, using the "(?s)" directive: '(?s).*?(?=<H1>)'
147
148 Functions
149 "extract_delimited"
150 The "extract_delimited" function formalizes the common idiom of
151 extracting a single-character-delimited substring from the start of
152 a string. For example, to extract a single-quote delimited string,
153 the following code is typically used:
154
155 ($remainder = $text) =~ s/\A('(\\.|[^'])*')//s;
156 $extracted = $1;
157
158 but with "extract_delimited" it can be simplified to:
159
160 ($extracted,$remainder) = extract_delimited($text, "'");
161
162 "extract_delimited" takes up to four scalars (the input text, the
163 delimiters, a prefix pattern to be skipped, and any escape
164 characters) and extracts the initial substring of the text that is
165 appropriately delimited. If the delimiter string has multiple
166 characters, the first one encountered in the text is taken to
167 delimit the substring. The third argument specifies a prefix
168 pattern that is to be skipped (but must be present!) before the
169 substring is extracted. The final argument specifies the escape
170 character to be used for each delimiter.
171
172 All arguments are optional. If the escape characters are not
173 specified, every delimiter is escaped with a backslash ("\"). If
174 the prefix is not specified, the pattern '\s*' - optional
175 whitespace - is used. If the delimiter set is also not specified,
176 the set "/["'`]/" is used. If the text to be processed is not
177 specified either, $_ is used.
178
179 In list context, "extract_delimited" returns a array of three
180 elements, the extracted substring (including the surrounding
181 delimiters), the remainder of the text, and the skipped prefix (if
182 any). If a suitable delimited substring is not found, the first
183 element of the array is the empty string, the second is the
184 complete original text, and the prefix returned in the third
185 element is an empty string.
186
187 In a scalar context, just the extracted substring is returned. In a
188 void context, the extracted substring (and any prefix) are simply
189 removed from the beginning of the first argument.
190
191 Examples:
192
193 # Remove a single-quoted substring from the very beginning of $text:
194
195 $substring = extract_delimited($text, "'", '');
196
197 # Remove a single-quoted Pascalish substring (i.e. one in which
198 # doubling the quote character escapes it) from the very
199 # beginning of $text:
200
201 $substring = extract_delimited($text, "'", '', "'");
202
203 # Extract a single- or double- quoted substring from the
204 # beginning of $text, optionally after some whitespace
205 # (note the list context to protect $text from modification):
206
207 ($substring) = extract_delimited $text, q{"'};
208
209 # Delete the substring delimited by the first '/' in $text:
210
211 $text = join '', (extract_delimited($text,'/','[^/]*')[2,1];
212
213 Note that this last example is not the same as deleting the first
214 quote-like pattern. For instance, if $text contained the string:
215
216 "if ('./cmd' =~ m/$UNIXCMD/s) { $cmd = $1; }"
217
218 then after the deletion it would contain:
219
220 "if ('.$UNIXCMD/s) { $cmd = $1; }"
221
222 not:
223
224 "if ('./cmd' =~ ms) { $cmd = $1; }"
225
226 See "extract_quotelike" for a (partial) solution to this problem.
227
228 "extract_bracketed"
229 Like "extract_delimited", the "extract_bracketed" function takes up
230 to three optional scalar arguments: a string to extract from, a
231 delimiter specifier, and a prefix pattern. As before, a missing
232 prefix defaults to optional whitespace and a missing text defaults
233 to $_. However, a missing delimiter specifier defaults to
234 '{}()[]<>' (see below).
235
236 "extract_bracketed" extracts a balanced-bracket-delimited substring
237 (using any one (or more) of the user-specified delimiter brackets:
238 '(..)', '{..}', '[..]', or '<..>'). Optionally it will also respect
239 quoted unbalanced brackets (see below).
240
241 A "delimiter bracket" is a bracket in list of delimiters passed as
242 "extract_bracketed"'s second argument. Delimiter brackets are
243 specified by giving either the left or right (or both!) versions of
244 the required bracket(s). Note that the order in which two or more
245 delimiter brackets are specified is not significant.
246
247 A "balanced-bracket-delimited substring" is a substring bounded by
248 matched brackets, such that any other (left or right) delimiter
249 bracket within the substring is also matched by an opposite (right
250 or left) delimiter bracket at the same level of nesting. Any type
251 of bracket not in the delimiter list is treated as an ordinary
252 character.
253
254 In other words, each type of bracket specified as a delimiter must
255 be balanced and correctly nested within the substring, and any
256 other kind of ("non-delimiter") bracket in the substring is
257 ignored.
258
259 For example, given the string:
260
261 $text = "{ an '[irregularly :-(] {} parenthesized >:-)' string }";
262
263 then a call to "extract_bracketed" in a list context:
264
265 @result = extract_bracketed( $text, '{}' );
266
267 would return:
268
269 ( "{ an '[irregularly :-(] {} parenthesized >:-)' string }" , "" , "" )
270
271 since both sets of '{..}' brackets are properly nested and evenly
272 balanced. (In a scalar context just the first element of the array
273 would be returned. In a void context, $text would be replaced by an
274 empty string.)
275
276 Likewise the call in:
277
278 @result = extract_bracketed( $text, '{[' );
279
280 would return the same result, since all sets of both types of
281 specified delimiter brackets are correctly nested and balanced.
282
283 However, the call in:
284
285 @result = extract_bracketed( $text, '{([<' );
286
287 would fail, returning:
288
289 ( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" );
290
291 because the embedded pairs of '(..)'s and '[..]'s are "cross-
292 nested" and the embedded '>' is unbalanced. (In a scalar context,
293 this call would return an empty string. In a void context, $text
294 would be unchanged.)
295
296 Note that the embedded single-quotes in the string don't help in
297 this case, since they have not been specified as acceptable
298 delimiters and are therefore treated as non-delimiter characters
299 (and ignored).
300
301 However, if a particular species of quote character is included in
302 the delimiter specification, then that type of quote will be
303 correctly handled. for example, if $text is:
304
305 $text = '<A HREF=">>>>">link</A>';
306
307 then
308
309 @result = extract_bracketed( $text, '<">' );
310
311 returns:
312
313 ( '<A HREF=">>>>">', 'link</A>', "" )
314
315 as expected. Without the specification of """ as an embedded
316 quoter:
317
318 @result = extract_bracketed( $text, '<>' );
319
320 the result would be:
321
322 ( '<A HREF=">', '>>>">link</A>', "" )
323
324 In addition to the quote delimiters "'", """, and "`", full Perl
325 quote-like quoting (i.e. q{string}, qq{string}, etc) can be
326 specified by including the letter 'q' as a delimiter. Hence:
327
328 @result = extract_bracketed( $text, '<q>' );
329
330 would correctly match something like this:
331
332 $text = '<leftop: conj /and/ conj>';
333
334 See also: "extract_quotelike" and "extract_codeblock".
335
336 "extract_variable"
337 "extract_variable" extracts any valid Perl variable or variable-
338 involved expression, including scalars, arrays, hashes, array
339 accesses, hash look-ups, method calls through objects, subroutine
340 calls through subroutine references, etc.
341
342 The subroutine takes up to two optional arguments:
343
344 1. A string to be processed ($_ if the string is omitted or
345 "undef")
346
347 2. A string specifying a pattern to be matched as a prefix (which
348 is to be skipped). If omitted, optional whitespace is skipped.
349
350 On success in a list context, an array of 3 elements is returned.
351 The elements are:
352
353 [0] the extracted variable, or variablish expression
354
355 [1] the remainder of the input text,
356
357 [2] the prefix substring (if any),
358
359 On failure, all of these values (except the remaining text) are
360 "undef".
361
362 In a scalar context, "extract_variable" returns just the complete
363 substring that matched a variablish expression. "undef" is returned
364 on failure. In addition, the original input text has the returned
365 substring (and any prefix) removed from it.
366
367 In a void context, the input text just has the matched substring
368 (and any specified prefix) removed.
369
370 "extract_tagged"
371 "extract_tagged" extracts and segments text between (balanced)
372 specified tags.
373
374 The subroutine takes up to five optional arguments:
375
376 1. A string to be processed ($_ if the string is omitted or
377 "undef")
378
379 2. A string specifying a pattern to be matched as the opening tag.
380 If the pattern string is omitted (or "undef") then a pattern
381 that matches any standard XML tag is used.
382
383 3. A string specifying a pattern to be matched at the closing tag.
384 If the pattern string is omitted (or "undef") then the closing
385 tag is constructed by inserting a "/" after any leading bracket
386 characters in the actual opening tag that was matched (not the
387 pattern that matched the tag). For example, if the opening tag
388 pattern is specified as '{{\w+}}' and actually matched the
389 opening tag "{{DATA}}", then the constructed closing tag would
390 be "{{/DATA}}".
391
392 4. A string specifying a pattern to be matched as a prefix (which
393 is to be skipped). If omitted, optional whitespace is skipped.
394
395 5. A hash reference containing various parsing options (see below)
396
397 The various options that can be specified are:
398
399 "reject => $listref"
400 The list reference contains one or more strings specifying
401 patterns that must not appear within the tagged text.
402
403 For example, to extract an HTML link (which should not contain
404 nested links) use:
405
406 extract_tagged($text, '<A>', '</A>', undef, {reject => ['<A>']} );
407
408 "ignore => $listref"
409 The list reference contains one or more strings specifying
410 patterns that are not to be treated as nested tags within the
411 tagged text (even if they would match the start tag pattern).
412
413 For example, to extract an arbitrary XML tag, but ignore
414 "empty" elements:
415
416 extract_tagged($text, undef, undef, undef, {ignore => ['<[^>]*/>']} );
417
418 (also see "gen_delimited_pat" below).
419
420 "fail => $str"
421 The "fail" option indicates the action to be taken if a
422 matching end tag is not encountered (i.e. before the end of the
423 string or some "reject" pattern matches). By default, a failure
424 to match a closing tag causes "extract_tagged" to immediately
425 fail.
426
427 However, if the string value associated with <reject> is "MAX",
428 then "extract_tagged" returns the complete text up to the point
429 of failure. If the string is "PARA", "extract_tagged" returns
430 only the first paragraph after the tag (up to the first line
431 that is either empty or contains only whitespace characters).
432 If the string is "", the default behaviour (i.e. failure) is
433 reinstated.
434
435 For example, suppose the start tag "/para" introduces a
436 paragraph, which then continues until the next "/endpara" tag
437 or until another "/para" tag is encountered:
438
439 $text = "/para line 1\n\nline 3\n/para line 4";
440
441 extract_tagged($text, '/para', '/endpara', undef,
442 {reject => '/para', fail => MAX );
443
444 # EXTRACTED: "/para line 1\n\nline 3\n"
445
446 Suppose instead, that if no matching "/endpara" tag is found,
447 the "/para" tag refers only to the immediately following
448 paragraph:
449
450 $text = "/para line 1\n\nline 3\n/para line 4";
451
452 extract_tagged($text, '/para', '/endpara', undef,
453 {reject => '/para', fail => MAX );
454
455 # EXTRACTED: "/para line 1\n"
456
457 Note that the specified "fail" behaviour applies to nested tags
458 as well.
459
460 On success in a list context, an array of 6 elements is returned.
461 The elements are:
462
463 [0] the extracted tagged substring (including the outermost tags),
464
465 [1] the remainder of the input text,
466
467 [2] the prefix substring (if any),
468
469 [3] the opening tag
470
471 [4] the text between the opening and closing tags
472
473 [5] the closing tag (or "" if no closing tag was found)
474
475 On failure, all of these values (except the remaining text) are
476 "undef".
477
478 In a scalar context, "extract_tagged" returns just the complete
479 substring that matched a tagged text (including the start and end
480 tags). "undef" is returned on failure. In addition, the original
481 input text has the returned substring (and any prefix) removed from
482 it.
483
484 In a void context, the input text just has the matched substring
485 (and any specified prefix) removed.
486
487 "gen_extract_tagged"
488 "gen_extract_tagged" generates a new anonymous subroutine which
489 extracts text between (balanced) specified tags. In other words, it
490 generates a function identical in function to "extract_tagged".
491
492 The difference between "extract_tagged" and the anonymous
493 subroutines generated by "gen_extract_tagged", is that those
494 generated subroutines:
495
496 • do not have to reparse tag specification or parsing options
497 every time they are called (whereas "extract_tagged" has to
498 effectively rebuild its tag parser on every call);
499
500 • make use of the new qr// construct to pre-compile the regexes
501 they use (whereas "extract_tagged" uses standard string
502 variable interpolation to create tag-matching patterns).
503
504 The subroutine takes up to four optional arguments (the same set as
505 "extract_tagged" except for the string to be processed). It returns
506 a reference to a subroutine which in turn takes a single argument
507 (the text to be extracted from).
508
509 In other words, the implementation of "extract_tagged" is exactly
510 equivalent to:
511
512 sub extract_tagged
513 {
514 my $text = shift;
515 $extractor = gen_extract_tagged(@_);
516 return $extractor->($text);
517 }
518
519 (although "extract_tagged" is not currently implemented that way).
520
521 Using "gen_extract_tagged" to create extraction functions for
522 specific tags is a good idea if those functions are going to be
523 called more than once, since their performance is typically twice
524 as good as the more general-purpose "extract_tagged".
525
526 "extract_quotelike"
527 "extract_quotelike" attempts to recognize, extract, and segment any
528 one of the various Perl quotes and quotelike operators (see
529 perlop(3)) Nested backslashed delimiters, embedded balanced bracket
530 delimiters (for the quotelike operators), and trailing modifiers
531 are all caught. For example, in:
532
533 extract_quotelike 'q # an octothorpe: \# (not the end of the q!) #'
534
535 extract_quotelike ' "You said, \"Use sed\"." '
536
537 extract_quotelike ' s{([A-Z]{1,8}\.[A-Z]{3})} /\L$1\E/; '
538
539 extract_quotelike ' tr/\\\/\\\\/\\\//ds; '
540
541 the full Perl quotelike operations are all extracted correctly.
542
543 Note too that, when using the /x modifier on a regex, any comment
544 containing the current pattern delimiter will cause the regex to be
545 immediately terminated. In other words:
546
547 'm /
548 (?i) # CASE INSENSITIVE
549 [a-z_] # LEADING ALPHABETIC/UNDERSCORE
550 [a-z0-9]* # FOLLOWED BY ANY NUMBER OF ALPHANUMERICS
551 /x'
552
553 will be extracted as if it were:
554
555 'm /
556 (?i) # CASE INSENSITIVE
557 [a-z_] # LEADING ALPHABETIC/'
558
559 This behaviour is identical to that of the actual compiler.
560
561 "extract_quotelike" takes two arguments: the text to be processed
562 and a prefix to be matched at the very beginning of the text. If no
563 prefix is specified, optional whitespace is the default. If no text
564 is given, $_ is used.
565
566 In a list context, an array of 11 elements is returned. The
567 elements are:
568
569 [0] the extracted quotelike substring (including trailing
570 modifiers),
571
572 [1] the remainder of the input text,
573
574 [2] the prefix substring (if any),
575
576 [3] the name of the quotelike operator (if any),
577
578 [4] the left delimiter of the first block of the operation,
579
580 [5] the text of the first block of the operation (that is, the
581 contents of a quote, the regex of a match or substitution or
582 the target list of a translation),
583
584 [6] the right delimiter of the first block of the operation,
585
586 [7] the left delimiter of the second block of the operation (that
587 is, if it is a "s", "tr", or "y"),
588
589 [8] the text of the second block of the operation (that is, the
590 replacement of a substitution or the translation list of a
591 translation),
592
593 [9] the right delimiter of the second block of the operation (if
594 any),
595
596 [10]
597 the trailing modifiers on the operation (if any).
598
599 For each of the fields marked "(if any)" the default value on
600 success is an empty string. On failure, all of these values
601 (except the remaining text) are "undef".
602
603 In a scalar context, "extract_quotelike" returns just the complete
604 substring that matched a quotelike operation (or "undef" on
605 failure). In a scalar or void context, the input text has the same
606 substring (and any specified prefix) removed.
607
608 Examples:
609
610 # Remove the first quotelike literal that appears in text
611
612 $quotelike = extract_quotelike($text,'.*?');
613
614 # Replace one or more leading whitespace-separated quotelike
615 # literals in $_ with "<QLL>"
616
617 do { $_ = join '<QLL>', (extract_quotelike)[2,1] } until $@;
618
619
620 # Isolate the search pattern in a quotelike operation from $text
621
622 ($op,$pat) = (extract_quotelike $text)[3,5];
623 if ($op =~ /[ms]/)
624 {
625 print "search pattern: $pat\n";
626 }
627 else
628 {
629 print "$op is not a pattern matching operation\n";
630 }
631
632 "extract_quotelike"
633 "extract_quotelike" can successfully extract "here documents" from
634 an input string, but with an important caveat in list contexts.
635
636 Unlike other types of quote-like literals, a here document is
637 rarely a contiguous substring. For example, a typical piece of code
638 using here document might look like this:
639
640 <<'EOMSG' || die;
641 This is the message.
642 EOMSG
643 exit;
644
645 Given this as an input string in a scalar context,
646 "extract_quotelike" would correctly return the string
647 "<<'EOMSG'\nThis is the message.\nEOMSG", leaving the string " ||
648 die;\nexit;" in the original variable. In other words, the two
649 separate pieces of the here document are successfully extracted and
650 concatenated.
651
652 In a list context, "extract_quotelike" would return the list
653
654 [0] "<<'EOMSG'\nThis is the message.\nEOMSG\n" (i.e. the full
655 extracted here document, including fore and aft delimiters),
656
657 [1] " || die;\nexit;" (i.e. the remainder of the input text,
658 concatenated),
659
660 [2] "" (i.e. the prefix substring -- trivial in this case),
661
662 [3] "<<" (i.e. the "name" of the quotelike operator)
663
664 [4] "'EOMSG'" (i.e. the left delimiter of the here document,
665 including any quotes),
666
667 [5] "This is the message.\n" (i.e. the text of the here document),
668
669 [6] "EOMSG" (i.e. the right delimiter of the here document),
670
671 [7..10]
672 "" (a here document has no second left delimiter, second text,
673 second right delimiter, or trailing modifiers).
674
675 However, the matching position of the input variable would be set
676 to "exit;" (i.e. after the closing delimiter of the here document),
677 which would cause the earlier " || die;\nexit;" to be skipped in
678 any sequence of code fragment extractions.
679
680 To avoid this problem, when it encounters a here document whilst
681 extracting from a modifiable string, "extract_quotelike" silently
682 rearranges the string to an equivalent piece of Perl:
683
684 <<'EOMSG'
685 This is the message.
686 EOMSG
687 || die;
688 exit;
689
690 in which the here document is contiguous. It still leaves the
691 matching position after the here document, but now the rest of the
692 line on which the here document starts is not skipped.
693
694 To prevent <extract_quotelike> from mucking about with the input in
695 this way (this is the only case where a list-context
696 "extract_quotelike" does so), you can pass the input variable as an
697 interpolated literal:
698
699 $quotelike = extract_quotelike("$var");
700
701 "extract_codeblock"
702 "extract_codeblock" attempts to recognize and extract a balanced
703 bracket delimited substring that may contain unbalanced brackets
704 inside Perl quotes or quotelike operations. That is,
705 "extract_codeblock" is like a combination of "extract_bracketed"
706 and "extract_quotelike".
707
708 "extract_codeblock" takes the same initial three parameters as
709 "extract_bracketed": a text to process, a set of delimiter brackets
710 to look for, and a prefix to match first. It also takes an optional
711 fourth parameter, which allows the outermost delimiter brackets to
712 be specified separately (see below).
713
714 Omitting the first argument (input text) means process $_ instead.
715 Omitting the second argument (delimiter brackets) indicates that
716 only '{' is to be used. Omitting the third argument (prefix
717 argument) implies optional whitespace at the start. Omitting the
718 fourth argument (outermost delimiter brackets) indicates that the
719 value of the second argument is to be used for the outermost
720 delimiters.
721
722 Once the prefix and the outermost opening delimiter bracket have
723 been recognized, code blocks are extracted by stepping through the
724 input text and trying the following alternatives in sequence:
725
726 1. Try and match a closing delimiter bracket. If the bracket was
727 the same species as the last opening bracket, return the
728 substring to that point. If the bracket was mismatched, return
729 an error.
730
731 2. Try to match a quote or quotelike operator. If found, call
732 "extract_quotelike" to eat it. If "extract_quotelike" fails,
733 return the error it returned. Otherwise go back to step 1.
734
735 3. Try to match an opening delimiter bracket. If found, call
736 "extract_codeblock" recursively to eat the embedded block. If
737 the recursive call fails, return an error. Otherwise, go back
738 to step 1.
739
740 4. Unconditionally match a bareword or any other single character,
741 and then go back to step 1.
742
743 Examples:
744
745 # Find a while loop in the text
746
747 if ($text =~ s/.*?while\s*\{/{/)
748 {
749 $loop = "while " . extract_codeblock($text);
750 }
751
752 # Remove the first round-bracketed list (which may include
753 # round- or curly-bracketed code blocks or quotelike operators)
754
755 extract_codeblock $text, "(){}", '[^(]*';
756
757 The ability to specify a different outermost delimiter bracket is
758 useful in some circumstances. For example, in the Parse::RecDescent
759 module, parser actions which are to be performed only on a
760 successful parse are specified using a "<defer:...>" directive. For
761 example:
762
763 sentence: subject verb object
764 <defer: {$::theVerb = $item{verb}} >
765
766 Parse::RecDescent uses "extract_codeblock($text, '{}<>')" to
767 extract the code within the "<defer:...>" directive, but there's a
768 problem.
769
770 A deferred action like this:
771
772 <defer: {if ($count>10) {$count--}} >
773
774 will be incorrectly parsed as:
775
776 <defer: {if ($count>
777
778 because the "less than" operator is interpreted as a closing
779 delimiter.
780
781 But, by extracting the directive using
782 "extract_codeblock($text, '{}', undef, '<>')" the '>' character is
783 only treated as a delimited at the outermost level of the code
784 block, so the directive is parsed correctly.
785
786 "extract_multiple"
787 The "extract_multiple" subroutine takes a string to be processed
788 and a list of extractors (subroutines or regular expressions) to
789 apply to that string.
790
791 In an array context "extract_multiple" returns an array of
792 substrings of the original string, as extracted by the specified
793 extractors. In a scalar context, "extract_multiple" returns the
794 first substring successfully extracted from the original string. In
795 both scalar and void contexts the original string has the first
796 successfully extracted substring removed from it. In all contexts
797 "extract_multiple" starts at the current "pos" of the string, and
798 sets that "pos" appropriately after it matches.
799
800 Hence, the aim of a call to "extract_multiple" in a list context is
801 to split the processed string into as many non-overlapping fields
802 as possible, by repeatedly applying each of the specified
803 extractors to the remainder of the string. Thus "extract_multiple"
804 is a generalized form of Perl's "split" subroutine.
805
806 The subroutine takes up to four optional arguments:
807
808 1. A string to be processed ($_ if the string is omitted or
809 "undef")
810
811 2. A reference to a list of subroutine references and/or qr//
812 objects and/or literal strings and/or hash references,
813 specifying the extractors to be used to split the string. If
814 this argument is omitted (or "undef") the list:
815
816 [
817 sub { extract_variable($_[0], '') },
818 sub { extract_quotelike($_[0],'') },
819 sub { extract_codeblock($_[0],'{}','') },
820 ]
821
822 is used.
823
824 3. An number specifying the maximum number of fields to return. If
825 this argument is omitted (or "undef"), split continues as long
826 as possible.
827
828 If the third argument is N, then extraction continues until N
829 fields have been successfully extracted, or until the string
830 has been completely processed.
831
832 Note that in scalar and void contexts the value of this
833 argument is automatically reset to 1 (under "-w", a warning is
834 issued if the argument has to be reset).
835
836 4. A value indicating whether unmatched substrings (see below)
837 within the text should be skipped or returned as fields. If the
838 value is true, such substrings are skipped. Otherwise, they are
839 returned.
840
841 The extraction process works by applying each extractor in sequence
842 to the text string.
843
844 If the extractor is a subroutine it is called in a list context and
845 is expected to return a list of a single element, namely the
846 extracted text. It may optionally also return two further
847 arguments: a string representing the text left after extraction
848 (like $' for a pattern match), and a string representing any prefix
849 skipped before the extraction (like $` in a pattern match). Note
850 that this is designed to facilitate the use of other Text::Balanced
851 subroutines with "extract_multiple". Note too that the value
852 returned by an extractor subroutine need not bear any relationship
853 to the corresponding substring of the original text (see examples
854 below).
855
856 If the extractor is a precompiled regular expression or a string,
857 it is matched against the text in a scalar context with a leading
858 '\G' and the gc modifiers enabled. The extracted value is either $1
859 if that variable is defined after the match, or else the complete
860 match (i.e. $&).
861
862 If the extractor is a hash reference, it must contain exactly one
863 element. The value of that element is one of the above extractor
864 types (subroutine reference, regular expression, or string). The
865 key of that element is the name of a class into which the
866 successful return value of the extractor will be blessed.
867
868 If an extractor returns a defined value, that value is immediately
869 treated as the next extracted field and pushed onto the list of
870 fields. If the extractor was specified in a hash reference, the
871 field is also blessed into the appropriate class,
872
873 If the extractor fails to match (in the case of a regex extractor),
874 or returns an empty list or an undefined value (in the case of a
875 subroutine extractor), it is assumed to have failed to extract. If
876 none of the extractor subroutines succeeds, then one character is
877 extracted from the start of the text and the extraction subroutines
878 reapplied. Characters which are thus removed are accumulated and
879 eventually become the next field (unless the fourth argument is
880 true, in which case they are discarded).
881
882 For example, the following extracts substrings that are valid Perl
883 variables:
884
885 @fields = extract_multiple($text,
886 [ sub { extract_variable($_[0]) } ],
887 undef, 1);
888
889 This example separates a text into fields which are quote
890 delimited, curly bracketed, and anything else. The delimited and
891 bracketed parts are also blessed to identify them (the "anything
892 else" is unblessed):
893
894 @fields = extract_multiple($text,
895 [
896 { Delim => sub { extract_delimited($_[0],q{'"}) } },
897 { Brack => sub { extract_bracketed($_[0],'{}') } },
898 ]);
899
900 This call extracts the next single substring that is a valid Perl
901 quotelike operator (and removes it from $text):
902
903 $quotelike = extract_multiple($text,
904 [
905 sub { extract_quotelike($_[0]) },
906 ], undef, 1);
907
908 Finally, here is yet another way to do comma-separated value
909 parsing:
910
911 @fields = extract_multiple($csv_text,
912 [
913 sub { extract_delimited($_[0],q{'"}) },
914 qr/([^,]+)(.*)/,
915 ],
916 undef,1);
917
918 The list in the second argument means: "Try and extract a ' or "
919 delimited string, otherwise extract anything up to a comma...".
920 The undef third argument means: "...as many times as possible...",
921 and the true value in the fourth argument means "...discarding
922 anything else that appears (i.e. the commas)".
923
924 If you wanted the commas preserved as separate fields (i.e. like
925 split does if your split pattern has capturing parentheses), you
926 would just make the last parameter undefined (or remove it).
927
928 "gen_delimited_pat"
929 The "gen_delimited_pat" subroutine takes a single (string) argument
930 and
931 > builds a Friedl-style optimized regex that matches a string
932 delimited by any one of the characters in the single argument. For
933 example:
934
935 gen_delimited_pat(q{'"})
936
937 returns the regex:
938
939 (?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
940
941 Note that the specified delimiters are automatically quotemeta'd.
942
943 A typical use of "gen_delimited_pat" would be to build special
944 purpose tags for "extract_tagged". For example, to properly ignore
945 "empty" XML elements (which might contain quoted strings):
946
947 my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';
948
949 extract_tagged($text, undef, undef, undef, {ignore => [$empty_tag]} );
950
951 "gen_delimited_pat" may also be called with an optional second
952 argument, which specifies the "escape" character(s) to be used for
953 each delimiter. For example to match a Pascal-style string (where
954 ' is the delimiter and '' is a literal ' within the string):
955
956 gen_delimited_pat(q{'},q{'});
957
958 Different escape characters can be specified for different
959 delimiters. For example, to specify that '/' is the escape for
960 single quotes and '%' is the escape for double quotes:
961
962 gen_delimited_pat(q{'"},q{/%});
963
964 If more delimiters than escape chars are specified, the last escape
965 char is used for the remaining delimiters. If no escape char is
966 specified for a given specified delimiter, '\' is used.
967
968 "delimited_pat"
969 Note that "gen_delimited_pat" was previously called
970 "delimited_pat". That name may still be used, but is now
971 deprecated.
972
974 In a list context, all the functions return "(undef,$original_text)" on
975 failure. In a scalar context, failure is indicated by returning "undef"
976 (in this case the input text is not modified in any way).
977
978 In addition, on failure in any context, the $@ variable is set.
979 Accessing "$@->{error}" returns one of the error diagnostics listed
980 below. Accessing "$@->{pos}" returns the offset into the original
981 string at which the error was detected (although not necessarily where
982 it occurred!) Printing $@ directly produces the error message, with
983 the offset appended. On success, the $@ variable is guaranteed to be
984 "undef".
985
986 The available diagnostics are:
987
988 "Did not find a suitable bracket: "%s""
989 The delimiter provided to "extract_bracketed" was not one of
990 '()[]<>{}'.
991
992 "Did not find prefix: /%s/"
993 A non-optional prefix was specified but wasn't found at the start
994 of the text.
995
996 "Did not find opening bracket after prefix: "%s""
997 "extract_bracketed" or "extract_codeblock" was expecting a
998 particular kind of bracket at the start of the text, and didn't
999 find it.
1000
1001 "No quotelike operator found after prefix: "%s""
1002 "extract_quotelike" didn't find one of the quotelike operators "q",
1003 "qq", "qw", "qx", "s", "tr" or "y" at the start of the substring it
1004 was extracting.
1005
1006 "Unmatched closing bracket: "%c""
1007 "extract_bracketed", "extract_quotelike" or "extract_codeblock"
1008 encountered a closing bracket where none was expected.
1009
1010 "Unmatched opening bracket(s): "%s""
1011 "extract_bracketed", "extract_quotelike" or "extract_codeblock" ran
1012 out of characters in the text before closing one or more levels of
1013 nested brackets.
1014
1015 "Unmatched embedded quote (%s)"
1016 "extract_bracketed" attempted to match an embedded quoted
1017 substring, but failed to find a closing quote to match it.
1018
1019 "Did not find closing delimiter to match '%s'"
1020 "extract_quotelike" was unable to find a closing delimiter to match
1021 the one that opened the quote-like operation.
1022
1023 "Mismatched closing bracket: expected "%c" but found "%s""
1024 "extract_bracketed", "extract_quotelike" or "extract_codeblock"
1025 found a valid bracket delimiter, but it was the wrong species. This
1026 usually indicates a nesting error, but may indicate incorrect
1027 quoting or escaping.
1028
1029 "No block delimiter found after quotelike "%s""
1030 "extract_quotelike" or "extract_codeblock" found one of the
1031 quotelike operators "q", "qq", "qw", "qx", "s", "tr" or "y" without
1032 a suitable block after it.
1033
1034 "Did not find leading dereferencer"
1035 "extract_variable" was expecting one of '$', '@', or '%' at the
1036 start of a variable, but didn't find any of them.
1037
1038 "Bad identifier after dereferencer"
1039 "extract_variable" found a '$', '@', or '%' indicating a variable,
1040 but that character was not followed by a legal Perl identifier.
1041
1042 "Did not find expected opening bracket at %s"
1043 "extract_codeblock" failed to find any of the outermost opening
1044 brackets that were specified.
1045
1046 "Improperly nested codeblock at %s"
1047 A nested code block was found that started with a delimiter that
1048 was specified as being only to be used as an outermost bracket.
1049
1050 "Missing second block for quotelike "%s""
1051 "extract_codeblock" or "extract_quotelike" found one of the
1052 quotelike operators "s", "tr" or "y" followed by only one block.
1053
1054 "No match found for opening bracket"
1055 "extract_codeblock" failed to find a closing bracket to match the
1056 outermost opening bracket.
1057
1058 "Did not find opening tag: /%s/"
1059 "extract_tagged" did not find a suitable opening tag (after any
1060 specified prefix was removed).
1061
1062 "Unable to construct closing tag to match: /%s/"
1063 "extract_tagged" matched the specified opening tag and tried to
1064 modify the matched text to produce a matching closing tag (because
1065 none was specified). It failed to generate the closing tag, almost
1066 certainly because the opening tag did not start with a bracket of
1067 some kind.
1068
1069 "Found invalid nested tag: %s"
1070 "extract_tagged" found a nested tag that appeared in the "reject"
1071 list (and the failure mode was not "MAX" or "PARA").
1072
1073 "Found unbalanced nested tag: %s"
1074 "extract_tagged" found a nested opening tag that was not matched by
1075 a corresponding nested closing tag (and the failure mode was not
1076 "MAX" or "PARA").
1077
1078 "Did not find closing tag"
1079 "extract_tagged" reached the end of the text without finding a
1080 closing tag to match the original opening tag (and the failure mode
1081 was not "MAX" or "PARA").
1082
1084 The following symbols are, or can be, exported by this module:
1085
1086 Default Exports
1087 None.
1088
1089 Optional Exports
1090 "extract_delimited", "extract_bracketed", "extract_quotelike",
1091 "extract_codeblock", "extract_variable", "extract_tagged",
1092 "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",
1093 "delimited_pat".
1094
1095 Export Tags
1096 ":ALL"
1097 "extract_delimited", "extract_bracketed", "extract_quotelike",
1098 "extract_codeblock", "extract_variable", "extract_tagged",
1099 "extract_multiple", "gen_delimited_pat", "gen_extract_tagged",
1100 "delimited_pat".
1101
1103 See
1104 <https://rt.cpan.org/Dist/Display.html?Status=Active&Queue=Text-Balanced>.
1105
1107 Patches, bug reports, suggestions or any other feedback is welcome.
1108
1109 Patches can be sent as GitHub pull requests at
1110 <https://github.com/steve-m-hay/Text-Balanced/pulls>.
1111
1112 Bug reports and suggestions can be made on the CPAN Request Tracker at
1113 <https://rt.cpan.org/Public/Bug/Report.html?Queue=Text-Balanced>.
1114
1115 Currently active requests on the CPAN Request Tracker can be viewed at
1116 <https://rt.cpan.org/Public/Dist/Display.html?Status=Active;Queue=Text-Balanced>.
1117
1118 Please test this distribution. See CPAN Testers Reports at
1119 <https://www.cpantesters.org/> for details of how to get involved.
1120
1121 Previous test results on CPAN Testers Reports can be viewed at
1122 <https://www.cpantesters.org/distro/T/Text-Balanced.html>.
1123
1124 Please rate this distribution on CPAN Ratings at
1125 <https://cpanratings.perl.org/rate/?distribution=Text-Balanced>.
1126
1128 The latest version of this module is available from CPAN (see "CPAN" in
1129 perlmodlib for details) at
1130
1131 <https://metacpan.org/release/Text-Balanced> or
1132
1133 <https://www.cpan.org/authors/id/S/SH/SHAY/> or
1134
1135 <https://www.cpan.org/modules/by-module/Text/>.
1136
1137 The latest source code is available from GitHub at
1138 <https://github.com/steve-m-hay/Text-Balanced>.
1139
1141 See the INSTALL file.
1142
1144 Damian Conway <damian@conway.org <mailto:damian@conway.org>>.
1145
1146 Steve Hay <shay@cpan.org <mailto:shay@cpan.org>> is now maintaining
1147 Text::Balanced as of version 2.03.
1148
1150 Copyright (C) 1997-2001 Damian Conway. All rights reserved.
1151
1152 Copyright (C) 2009 Adam Kennedy.
1153
1154 Copyright (C) 2015, 2020 Steve Hay. All rights reserved.
1155
1157 This module is free software; you can redistribute it and/or modify it
1158 under the same terms as Perl itself, i.e. under the terms of either the
1159 GNU General Public License or the Artistic License, as specified in the
1160 LICENCE file.
1161
1163 Version 2.04
1164
1166 11 Dec 2020
1167
1169 See the Changes file.
1170
1171
1172
1173perl v5.32.1 2021-01-27 Text::Balanced(3)