1Twig(3) User Contributed Perl Documentation Twig(3)
2
3
4
6 XML::Twig - A perl module for processing huge XML documents in tree
7 mode.
8
10 Note that this documentation is intended as a reference to the module.
11
12 Complete docs, including a tutorial, examples, an easier to use HTML
13 version, a quick reference card and a FAQ are available at
14 <http://www.xmltwig.org/xmltwig>
15
16 Small documents (loaded in memory as a tree):
17
18 my $twig=XML::Twig->new(); # create the twig
19 $twig->parsefile( 'doc.xml'); # build it
20 my_process( $twig); # use twig methods to process it
21 $twig->print; # output the twig
22
23 Huge documents (processed in combined stream/tree mode):
24
25 # at most one div will be loaded in memory
26 my $twig=XML::Twig->new(
27 twig_handlers =>
28 { title => sub { $_->set_tag( 'h2') }, # change title tags to h2
29 # $_ is the current element
30 para => sub { $_->set_tag( 'p') }, # change para to p
31 hidden => sub { $_->delete; }, # remove hidden elements
32 list => \&my_list_process, # process list elements
33 div => sub { $_[0]->flush; }, # output and free memory
34 },
35 pretty_print => 'indented', # output will be nicely formatted
36 empty_tags => 'html', # outputs <empty_tag />
37 );
38 $twig->parsefile( 'my_big.xml');
39
40 sub my_list_process
41 { my( $twig, $list)= @_;
42 # ...
43 }
44
45 See XML::Twig 101 for other ways to use the module, as a filter for
46 example.
47
49 This module provides a way to process XML documents. It is build on top
50 of "XML::Parser".
51
52 The module offers a tree interface to the document, while allowing you
53 to output the parts of it that have been completely processed.
54
55 It allows minimal resource (CPU and memory) usage by building the tree
56 only for the parts of the documents that need actual processing,
57 through the use of the "twig_roots " and "twig_print_outside_roots "
58 options. The "finish " and "finish_print " methods also help to
59 increase performances.
60
61 XML::Twig tries to make simple things easy so it tries its best to
62 takes care of a lot of the (usually) annoying (but sometimes necessary)
63 features that come with XML and XML::Parser.
64
66 XML::Twig comes with a few command-line utilities:
67
68 xml_pp - xml pretty-printer
69 XML pretty printer using XML::Twig
70
71 xml_grep - grep XML files looking for specific elements
72 "xml_grep" does a grep on XML files. Instead of using regular
73 expressions it uses XPath expressions (in fact the subset of XPath
74 supported by XML::Twig).
75
76 xml_split - cut a big XML file into smaller chunks
77 "xml_split" takes a (presumably big) XML file and split it in several
78 smaller files, based on various criteria (level in the tree, size or an
79 XPath expression)
80
81 xml_merge - merge back XML files split with xml_split
82 "xml_merge" takes several xml files that have been split using
83 "xml_split" and recreates a single file.
84
85 xml_spellcheck - spellcheck XML files
86 "xml_spellcheck" lets you spell check the content of an XML file. It
87 extracts the text (the content of elements and optionally of
88 attributes), call a spell checker on it and then recreates the XML
89 document.
90
92 XML::Twig can be used either on "small" XML documents (that fit in
93 memory) or on huge ones, by processing parts of the document and
94 outputting or discarding them once they are processed.
95
96 Loading an XML document and processing it
97 my $t= XML::Twig->new();
98 $t->parse( '<d><title>title</title><para>p 1</para><para>p 2</para></d>');
99 my $root= $t->root;
100 $root->set_tag( 'html'); # change doc to html
101 $title= $root->first_child( 'title'); # get the title
102 $title->set_tag( 'h1'); # turn it into h1
103 my @para= $root->children( 'para'); # get the para children
104 foreach my $para (@para)
105 { $para->set_tag( 'p'); } # turn them into p
106 $t->print; # output the document
107
108 Other useful methods include:
109
110 att: "$elt->{'att'}->{'foo'}" return the "foo" attribute for an
111 element,
112
113 set_att : "$elt->set_att( foo => "bar")" sets the "foo" attribute to
114 the "bar" value,
115
116 next_sibling: "$elt->{next_sibling}" return the next sibling in the
117 document (in the example "$title->{next_sibling}" is the first "para",
118 you can also (and actually should) use "$elt->next_sibling( 'para')" to
119 get it
120
121 The document can also be transformed through the use of the cut, copy,
122 paste and move methods: "$title->cut; $title->paste( after => $p);" for
123 example
124
125 And much, much more, see XML::Twig::Elt.
126
127 Processing an XML document chunk by chunk
128 One of the strengths of XML::Twig is that it let you work with files
129 that do not fit in memory (BTW storing an XML document in memory as a
130 tree is quite memory-expensive, the expansion factor being often around
131 10).
132
133 To do this you can define handlers, that will be called once a specific
134 element has been completely parsed. In these handlers you can access
135 the element and process it as you see fit, using the navigation and the
136 cut-n-paste methods, plus lots of convenient ones like "prefix ". Once
137 the element is completely processed you can then "flush " it, which
138 will output it and free the memory. You can also "purge " it if you
139 don't need to output it (if you are just extracting some data from the
140 document for example). The handler will be called again once the next
141 relevant element has been parsed.
142
143 my $t= XML::Twig->new( twig_handlers =>
144 { section => \§ion,
145 para => sub { $_->set_tag( 'p'); }
146 },
147 );
148 $t->parsefile( 'doc.xml');
149
150 # the handler is called once a section is completely parsed, ie when
151 # the end tag for section is found, it receives the twig itself and
152 # the element (including all its sub-elements) as arguments
153 sub section
154 { my( $t, $section)= @_; # arguments for all twig_handlers
155 $section->set_tag( 'div'); # change the tag name
156 # let's use the attribute nb as a prefix to the title
157 my $title= $section->first_child( 'title'); # find the title
158 my $nb= $title->{'att'}->{'nb'}; # get the attribute
159 $title->prefix( "$nb - "); # easy isn't it?
160 $section->flush; # outputs the section and frees memory
161 }
162
163 There is of course more to it: you can trigger handlers on more
164 elaborate conditions than just the name of the element, "section/title"
165 for example.
166
167 my $t= XML::Twig->new( twig_handlers =>
168 { 'section/title' => sub { $_->print } }
169 )
170 ->parsefile( 'doc.xml');
171
172 Here "sub { $_->print }" simply prints the current element ($_ is
173 aliased to the element in the handler).
174
175 You can also trigger a handler on a test on an attribute:
176
177 my $t= XML::Twig->new( twig_handlers =>
178 { 'section[@level="1"]' => sub { $_->print } }
179 );
180 ->parsefile( 'doc.xml');
181
182 You can also use "start_tag_handlers " to process an element as soon as
183 the start tag is found. Besides "prefix " you can also use "suffix ",
184
185 Processing just parts of an XML document
186 The twig_roots mode builds only the required sub-trees from the
187 document Anything outside of the twig roots will just be ignored:
188
189 my $t= XML::Twig->new(
190 # the twig will include just the root and selected titles
191 twig_roots => { 'section/title' => \&print_n_purge,
192 'annex/title' => \&print_n_purge
193 }
194 );
195 $t->parsefile( 'doc.xml');
196
197 sub print_n_purge
198 { my( $t, $elt)= @_;
199 print $elt->text; # print the text (including sub-element texts)
200 $t->purge; # frees the memory
201 }
202
203 You can use that mode when you want to process parts of a documents but
204 are not interested in the rest and you don't want to pay the price,
205 either in time or memory, to build the tree for the it.
206
207 Building an XML filter
208 You can combine the "twig_roots" and the "twig_print_outside_roots"
209 options to build filters, which let you modify selected elements and
210 will output the rest of the document as is.
211
212 This would convert prices in $ to prices in Euro in a document:
213
214 my $t= XML::Twig->new(
215 twig_roots => { 'price' => \&convert, }, # process prices
216 twig_print_outside_roots => 1, # print the rest
217 );
218 $t->parsefile( 'doc.xml');
219
220 sub convert
221 { my( $t, $price)= @_;
222 my $currency= $price->{'att'}->{'currency'}; # get the currency
223 if( $currency eq 'USD')
224 { $usd_price= $price->text; # get the price
225 # %rate is just a conversion table
226 my $euro_price= $usd_price * $rate{usd2euro};
227 $price->set_text( $euro_price); # set the new price
228 $price->set_att( currency => 'EUR'); # don't forget this!
229 }
230 $price->print; # output the price
231 }
232
233 XML::Twig and various versions of Perl, XML::Parser and expat:
234 XML::Twig is a lot more sensitive to variations in versions of perl,
235 XML::Parser and expat than to the OS, so this should cover some
236 reasonable configurations.
237
238 The "recommended configuration" is perl 5.8.3+ (for good Unicode
239 support), XML::Parser 2.31+ and expat 1.95.5+
240
241 See <http://testers.cpan.org/search?request=dist&dist=XML-Twig> for the
242 CPAN testers reports on XML::Twig, which list all tested
243 configurations.
244
245 An Atom feed of the CPAN Testers results is available at
246 <http://xmltwig.org/rss/twig_testers.rss>
247
248 Finally:
249
250 XML::Twig does NOT work with expat 1.95.4
251 XML::Twig only works with XML::Parser 2.27 in perl 5.6.*
252 Note that I can't compile XML::Parser 2.27 anymore, so I can't
253 guarantee that it still works
254
255 XML::Parser 2.28 does not really work
256
257 When in doubt, upgrade expat, XML::Parser and Scalar::Util
258
259 Finally, for some optional features, XML::Twig depends on some
260 additional modules. The complete list, which depends somewhat on the
261 version of Perl that you are running, is given by running
262 "t/zz_dump_config.t"
263
265 Whitespaces
266 Whitespaces that look non-significant are discarded, this behaviour
267 can be controlled using the "keep_spaces ", "keep_spaces_in " and
268 "discard_spaces_in " options.
269
270 Encoding
271 You can specify that you want the output in the same encoding as
272 the input (provided you have valid XML, which means you have to
273 specify the encoding either in the document or when you create the
274 Twig object) using the "keep_encoding " option
275
276 You can also use "output_encoding" to convert the internal UTF-8
277 format to the required encoding.
278
279 Comments and Processing Instructions (PI)
280 Comments and PI's can be hidden from the processing, but still
281 appear in the output (they are carried by the "real" element closer
282 to them)
283
284 Pretty Printing
285 XML::Twig can output the document pretty printed so it is easier to
286 read for us humans.
287
288 Surviving an untimely death
289 XML parsers are supposed to react violently when fed improper XML.
290 XML::Parser just dies.
291
292 XML::Twig provides the "safe_parse " and the "safe_parsefile "
293 methods which wrap the parse in an eval and return either the
294 parsed twig or 0 in case of failure.
295
296 Private attributes
297 Attributes with a name starting with # (illegal in XML) will not be
298 output, so you can safely use them to store temporary values during
299 processing. Note that you can store anything in a private
300 attribute, not just text, it's just a regular Perl variable, so a
301 reference to an object or a huge data structure is perfectly fine.
302
304 XML::Twig uses a very limited number of classes. The ones you are most
305 likely to use are "XML::Twig" of course, which represents a complete
306 XML document, including the document itself (the root of the document
307 itself is "root"), its handlers, its input or output filters... The
308 other main class is "XML::Twig::Elt", which models an XML element.
309 Element here has a very wide definition: it can be a regular element,
310 or but also text, with an element "tag" of "#PCDATA" (or "#CDATA"), an
311 entity (tag is "#ENT"), a Processing Instruction ("#PI"), a comment
312 ("#COMMENT").
313
314 Those are the 2 commonly used classes.
315
316 You might want to look the "elt_class" option if you want to subclass
317 "XML::Twig::Elt".
318
319 Attributes are just attached to their parent element, they are not
320 objects per se. (Please use the provided methods "att" and "set_att" to
321 access them, if you access them as a hash, then your code becomes
322 implementation dependent and might break in the future).
323
324 Other classes that are seldom used are "XML::Twig::Entity_list" and
325 "XML::Twig::Entity".
326
327 If you use "XML::Twig::XPath" instead of "XML::Twig", elements are then
328 created as "XML::Twig::XPath::Elt"
329
331 XML::Twig
332 A twig is a subclass of XML::Parser, so all XML::Parser methods can be
333 called on a twig object, including parse and parsefile. "setHandlers"
334 on the other hand cannot be used, see "BUGS "
335
336 new This is a class method, the constructor for XML::Twig. Options are
337 passed as keyword value pairs. Recognized options are the same as
338 XML::Parser, plus some (in fact a lot!) XML::Twig specifics.
339
340 New Options:
341
342 twig_handlers
343 This argument consists of a hash "{ expression =" \&handler}>
344 where expression is a an XPath-like expression (+ some others).
345
346 XPath expressions are limited to using the child and descendant
347 axis (indeed you can't specify an axis), and predicates cannot
348 be nested. You can use the "string", or "string(<tag>)"
349 function (except in "twig_roots" triggers).
350
351 Additionally you can use regexps (/ delimited) to match
352 attribute and string values.
353
354 Examples:
355
356 foo
357 foo/bar
358 foo//bar
359 /foo/bar
360 /foo//bar
361 /foo/bar[@att1 = "val1" and @att2 = "val2"]/baz[@a >= 1]
362 foo[string()=~ /^duh!+/]
363 /foo[string(bar)=~ /\d+/]/baz[@att != 3]
364
365 #CDATA can be used to call a handler for a CDATA section.
366 #COMMENT can be used to call a handler for comments
367
368 Some additional (non-XPath) expressions are also provided for
369 convenience:
370
371 processing instructions
372 '?' or '#PI' triggers the handler for any processing
373 instruction, and '?<target>' or '#PI <target>' triggers a
374 handler for processing instruction with the given target(
375 ex: '#PI xml-stylesheet').
376
377 level(<level>)
378 Triggers the handler on any element at that level in the
379 tree (root is level 1)
380
381 _all_
382 Triggers the handler for all elements in the tree
383
384 _default_
385 Triggers the handler for each element that does NOT have
386 any other handler.
387
388 Expressions are evaluated against the input document. Which
389 means that even if you have changed the tag of an element
390 (changing the tag of a parent element from a handler for
391 example) the change will not impact the expression evaluation.
392 There is an exception to this: "private" attributes (which name
393 start with a '#', and can only be created during the parsing,
394 as they are not valid XML) are checked against the current
395 twig.
396
397 Handlers are triggered in fixed order, sorted by their type
398 (xpath expressions first, then regexps, then level), then by
399 whether they specify a full path (starting at the root element)
400 or not, then by number of steps in the expression, then number
401 of predicates, then number of tests in predicates. Handlers
402 where the last step does not specify a step ("foo/bar/*") are
403 triggered after other XPath handlers. Finally "_all_" handlers
404 are triggered last.
405
406 Important: once a handler has been triggered if it returns 0
407 then no other handler is called, except a "_all_" handler which
408 will be called anyway.
409
410 If a handler returns a true value and other handlers apply,
411 then the next applicable handler will be called. Repeat, rinse,
412 lather..; The exception to that rule is when the
413 "do_not_chain_handlers" option is set, in which case only the
414 first handler will be called.
415
416 Note that it might be a good idea to explicitly return a short
417 true value (like 1) from handlers: this ensures that other
418 applicable handlers are called even if the last statement for
419 the handler happens to evaluate to false. This might also
420 speedup the code by avoiding the result of the last statement
421 of the code to be copied and passed to the code managing
422 handlers. It can really pay to have 1 instead of a long string
423 returned.
424
425 When the closing tag for an element is parsed the corresponding
426 handler is called, with 2 arguments: the twig and the "Element
427 ". The twig includes the document tree that has been built so
428 far, the element is the complete sub-tree for the element. The
429 fact that the handler is called only when the closing tag for
430 the element is found means that handlers for inner elements are
431 called before handlers for outer elements.
432
433 $_ is also set to the element, so it is easy to write inline
434 handlers like
435
436 para => sub { $_->set_tag( 'p'); }
437
438 Text is stored in elements whose tag name is #PCDATA (due to
439 mixed content, text and sub-element in an element there is no
440 way to store the text as just an attribute of the enclosing
441 element, this is similar to the DOM model).
442
443 Warning: if you have used purge or flush on the twig the
444 element might not be complete, some of its children might have
445 been entirely flushed or purged, and the start tag might even
446 have been printed (by "flush") already, so changing its tag
447 might not give the expected result.
448
449 twig_roots
450 This argument let's you build the tree only for those elements
451 you are interested in.
452
453 Example: my $t= XML::Twig->new( twig_roots => { title => 1, subtitle => 1});
454 $t->parsefile( file);
455 my $t= XML::Twig->new( twig_roots => { 'section/title' => 1});
456 $t->parsefile( file);
457
458 return a twig containing a document including only "title" and
459 "subtitle" elements, as children of the root element.
460
461 You can use generic_attribute_condition, attribute_condition,
462 full_path, partial_path, tag, tag_regexp, _default_ and _all_
463 to trigger the building of the twig. string_condition and
464 regexp_condition cannot be used as the content of the element,
465 and the string, have not yet been parsed when the condition is
466 checked.
467
468 WARNING: path are checked for the document. Even if the
469 "twig_roots" option is used they will be checked against the
470 full document tree, not the virtual tree created by XML::Twig
471
472 WARNING: twig_roots elements should NOT be nested, that would
473 hopelessly confuse XML::Twig ;--(
474
475 Note: you can set handlers (twig_handlers) using twig_roots
476 Example: my $t= XML::Twig->new( twig_roots =>
477 { title => sub {
478 $_[1]->print;},
479 subtitle =>
480 \&process_subtitle
481 }
482 );
483 $t->parsefile( file);
484
485 twig_print_outside_roots
486 To be used in conjunction with the "twig_roots" argument. When
487 set to a true value this will print the document outside of the
488 "twig_roots" elements.
489
490 Example: my $t= XML::Twig->new( twig_roots => { title => \&number_title },
491 twig_print_outside_roots => 1,
492 );
493 $t->parsefile( file);
494 { my $nb;
495 sub number_title
496 { my( $twig, $title);
497 $nb++;
498 $title->prefix( "$nb ");
499 $title->print;
500 }
501 }
502
503 This example prints the document outside of the title element,
504 calls "number_title" for each "title" element, prints it, and
505 then resumes printing the document. The twig is built only for
506 the "title" elements.
507
508 If the value is a reference to a file handle then the document
509 outside the "twig_roots" elements will be output to this file
510 handle:
511
512 open( my $out, '>', 'out_file.xml') or die "cannot open out file.xml out_file:$!";
513 my $t= XML::Twig->new( twig_roots => { title => \&number_title },
514 # default output to $out
515 twig_print_outside_roots => $out,
516 );
517
518 { my $nb;
519 sub number_title
520 { my( $twig, $title);
521 $nb++;
522 $title->prefix( "$nb ");
523 $title->print( $out); # you have to print to \*OUT here
524 }
525 }
526
527 start_tag_handlers
528 A hash "{ expression =" \&handler}>. Sets element handlers that
529 are called when the element is open (at the end of the
530 XML::Parser "Start" handler). The handlers are called with 2
531 params: the twig and the element. The element is empty at that
532 point, its attributes are created though.
533
534 You can use generic_attribute_condition, attribute_condition,
535 full_path, partial_path, tag, tag_regexp, _default_ and _all_
536 to trigger the handler.
537
538 string_condition and regexp_condition cannot be used as the
539 content of the element, and the string, have not yet been
540 parsed when the condition is checked.
541
542 The main uses for those handlers are to change the tag name
543 (you might have to do it as soon as you find the open tag if
544 you plan to "flush" the twig at some point in the element, and
545 to create temporary attributes that will be used when
546 processing sub-element with "twig_hanlders".
547
548 Note: "start_tag" handlers can be called outside of
549 "twig_roots" if this argument is used. Since the element object
550 is not built, in this case handlers are called with the
551 following arguments: $t (the twig), $tag (the tag of the
552 element) and %att (a hash of the attributes of the element).
553
554 If the "twig_print_outside_roots" argument is also used, if the
555 last handler called returns a "true" value, then the start tag
556 will be output as it appeared in the original document, if the
557 handler returns a "false" value then the start tag will not be
558 printed (so you can print a modified string yourself for
559 example).
560
561 Note that you can use the ignore method in "start_tag_handlers"
562 (and only there).
563
564 end_tag_handlers
565 A hash "{ expression =" \&handler}>. Sets element handlers that
566 are called when the element is closed (at the end of the
567 XML::Parser "End" handler). The handlers are called with 2
568 params: the twig and the tag of the element.
569
570 twig_handlers are called when an element is completely parsed,
571 so why have this redundant option? There is only one use for
572 "end_tag_handlers": when using the "twig_roots" option, to
573 trigger a handler for an element outside the roots. It is for
574 example very useful to number titles in a document using nested
575 sections:
576
577 my @no= (0);
578 my $no;
579 my $t= XML::Twig->new(
580 start_tag_handlers =>
581 { section => sub { $no[$#no]++; $no= join '.', @no; push @no, 0; } },
582 twig_roots =>
583 { title => sub { $_->prefix( $no); $_->print; } },
584 end_tag_handlers => { section => sub { pop @no; } },
585 twig_print_outside_roots => 1
586 );
587 $t->parsefile( $file);
588
589 Using the "end_tag_handlers" argument without "twig_roots" will
590 result in an error.
591
592 do_not_chain_handlers
593 If this option is set to a true value, then only one handler
594 will be called for each element, even if several satisfy the
595 condition
596
597 Note that the "_all_" handler will still be called regardless
598
599 ignore_elts
600 This option lets you ignore elements when building the twig.
601 This is useful in cases where you cannot use "twig_roots" to
602 ignore elements, for example if the element to ignore is a
603 sibling of elements you are interested in.
604
605 Example:
606
607 my $twig= XML::Twig->new( ignore_elts => { elt => 'discard' });
608 $twig->parsefile( 'doc.xml');
609
610 This will build the complete twig for the document, except that
611 all "elt" elements (and their children) will be left out.
612
613 The keys in the hash are triggers, limited to the same subset
614 as "start_tag_handlers". The values can be "discard", to
615 discard the element, "print", to output the element as-is,
616 "string" to store the text of the ignored element(s), including
617 markup, in a field of the twig: "$t->{twig_buffered_string}" or
618 a reference to a scalar, in which case the text of the ignored
619 element(s), including markup, will be stored in the scalar. Any
620 other value will be treated as "discard".
621
622 char_handler
623 A reference to a subroutine that will be called every time
624 "PCDATA" is found.
625
626 The subroutine receives the string as argument, and returns the
627 modified string:
628
629 # WE WANT ALL STRINGS IN UPPER CASE
630 sub my_char_handler
631 { my( $text)= @_;
632 $text= uc( $text);
633 return $text;
634 }
635
636 elt_class
637 The name of a class used to store elements. this class should
638 inherit from "XML::Twig::Elt" (and by default it is
639 "XML::Twig::Elt"). This option is used to subclass the element
640 class and extend it with new methods.
641
642 This option is needed because during the parsing of the XML,
643 elements are created by "XML::Twig", without any control from
644 the user code.
645
646 keep_atts_order
647 Setting this option to a true value causes the attribute hash
648 to be tied to a "Tie::IxHash" object. This means that
649 "Tie::IxHash" needs to be installed for this option to be
650 available. It also means that the hash keeps its order, so you
651 will get the attributes in order. This allows outputting the
652 attributes in the same order as they were in the original
653 document.
654
655 keep_encoding
656 This is a (slightly?) evil option: if the XML document is not
657 UTF-8 encoded and you want to keep it that way, then setting
658 keep_encoding will use the"Expat" original_string method for
659 character, thus keeping the original encoding, as well as the
660 original entities in the strings.
661
662 See the "t/test6.t" test file to see what results you can
663 expect from the various encoding options.
664
665 WARNING: if the original encoding is multi-byte then attribute
666 parsing will be EXTREMELY unsafe under any Perl before 5.6, as
667 it uses regular expressions which do not deal properly with
668 multi-byte characters. You can specify an alternate function to
669 parse the start tags with the "parse_start_tag" option (see
670 below)
671
672 WARNING: this option is NOT used when parsing with XML::Parser
673 non-blocking parser ("parse_start", "parse_more", "parse_done"
674 methods) which you probably should not use with XML::Twig
675 anyway as they are totally untested!
676
677 output_encoding
678 This option generates an output_filter using "Encode",
679 "Text::Iconv" or "Unicode::Map8" and "Unicode::Strings", and
680 sets the encoding in the XML declaration. This is the easiest
681 way to deal with encodings, if you need more sophisticated
682 features, look at "output_filter" below
683
684 output_filter
685 This option is used to convert the character encoding of the
686 output document. It is passed either a string corresponding to
687 a predefined filter or a subroutine reference. The filter will
688 be called every time a document or element is processed by the
689 "print" functions ("print", "sprint", "flush").
690
691 Pre-defined filters:
692
693 latin1
694 uses either "Encode", "Text::Iconv" or "Unicode::Map8" and
695 "Unicode::String" or a regexp (which works only with
696 XML::Parser 2.27), in this order, to convert all characters
697 to ISO-8859-15 (usually latin1 is synonym to ISO-8859-1,
698 but in practice it seems that ISO-8859-15, which includes
699 the euro sign, is more useful and probably what most people
700 want).
701
702 html
703 does the same conversion as "latin1", plus encodes entities
704 using "HTML::Entities" (oddly enough you will need to have
705 HTML::Entities installed for it to be available). This
706 should only be used if the tags and attribute names
707 themselves are in US-ASCII, or they will be converted and
708 the output will not be valid XML any more
709
710 safe
711 converts the output to ASCII (US) only plus character
712 entities ("&#nnn;") this should be used only if the tags
713 and attribute names themselves are in US-ASCII, or they
714 will be converted and the output will not be valid XML any
715 more
716
717 safe_hex
718 same as "safe" except that the character entities are in
719 hex ("&#xnnn;")
720
721 encode_convert ($encoding)
722 Return a subref that can be used to convert utf8 strings to
723 $encoding). Uses "Encode".
724
725 my $conv = XML::Twig::encode_convert( 'latin1');
726 my $t = XML::Twig->new(output_filter => $conv);
727
728 iconv_convert ($encoding)
729 this function is used to create a filter subroutine that
730 will be used to convert the characters to the target
731 encoding using "Text::Iconv" (which needs to be installed,
732 look at the documentation for the module and for the
733 "iconv" library to find out which encodings are available
734 on your system, "iconv -l" should give you a list of
735 available encodings)
736
737 my $conv = XML::Twig::iconv_convert( 'latin1');
738 my $t = XML::Twig->new(output_filter => $conv);
739
740 unicode_convert ($encoding)
741 this function is used to create a filter subroutine that
742 will be used to convert the characters to the target
743 encoding using "Unicode::Strings" and "Unicode::Map8"
744 (which need to be installed, look at the documentation for
745 the modules to find out which encodings are available on
746 your system)
747
748 my $conv = XML::Twig::unicode_convert( 'latin1');
749 my $t = XML::Twig->new(output_filter => $conv);
750
751 The "text" and "att" methods do not use the filter, so their
752 result are always in unicode.
753
754 Those predeclared filters are based on subroutines that can be
755 used by themselves (as "XML::Twig::foo").
756
757 html_encode ($string)
758 Use "HTML::Entities" to encode a utf8 string
759
760 safe_encode ($string)
761 Use either a regexp (perl < 5.8) or "Encode" to encode non-
762 ascii characters in the string in "&#<nnnn>;" format
763
764 safe_encode_hex ($string)
765 Use either a regexp (perl < 5.8) or "Encode" to encode non-
766 ascii characters in the string in "&#x<nnnn>;" format
767
768 regexp2latin1 ($string)
769 Use a regexp to encode a utf8 string into latin 1
770 (ISO-8859-1). Does not work with Perl 5.8.0!
771
772 output_text_filter
773 same as output_filter, except it doesn't apply to the brackets
774 and quotes around attribute values. This is useful for all
775 filters that could change the tagging, basically anything that
776 does not just change the encoding of the output. "html", "safe"
777 and "safe_hex" are better used with this option.
778
779 input_filter
780 This option is similar to "output_filter" except the filter is
781 applied to the characters before they are stored in the twig,
782 at parsing time.
783
784 remove_cdata
785 Setting this option to a true value will force the twig to
786 output CDATA sections as regular (escaped) PCDATA
787
788 parse_start_tag
789 If you use the "keep_encoding" option then this option can be
790 used to replace the default parsing function. You should
791 provide a coderef (a reference to a subroutine) as the
792 argument, this subroutine takes the original tag (given by
793 XML::Parser::Expat "original_string()" method) and returns a
794 tag and the attributes in a hash (or in a list
795 attribute_name/attribute value).
796
797 no_xxe
798 prevents external entities to be parsed.
799
800 This is a security feature, in case the input XML cannot be
801 trusted. With this option set to a true value defining external
802 entities in the document will cause the parse to fail.
803
804 This prevents an entity like "<!ENTITY xxe PUBLIC "bar"
805 "/etc/passwd">" to make the password fiel available in the
806 document.
807
808 expand_external_ents
809 When this option is used external entities (that are defined)
810 are expanded when the document is output using "print"
811 functions such as "print ", "sprint ", "flush " and "xml_string
812 ". Note that in the twig the entity will be stored as an
813 element with a tag '"#ENT"', the entity will not be expanded
814 there, so you might want to process the entities before
815 outputting it.
816
817 If an external entity is not available, then the parse will
818 fail.
819
820 A special case is when the value of this option is -1. In that
821 case a missing entity will not cause the parser to die, but its
822 "name", "sysid" and "pubid" will be stored in the twig as
823 "$twig->{twig_missing_system_entities}" (a reference to an
824 array of hashes { name => <name>, sysid => <sysid>, pubid =>
825 <pubid> }). Yes, this is a bit of a hack, but it's useful in
826 some cases.
827
828 load_DTD
829 If this argument is set to a true value, "parse" or "parsefile"
830 on the twig will load the DTD information. This information
831 can then be accessed through the twig, in a "DTD_handler" for
832 example. This will load even an external DTD.
833
834 Default and fixed values for attributes will also be filled,
835 based on the DTD.
836
837 Note that to do this the module will generate a temporary file
838 in the current directory. If this is a problem let me know and
839 I will add an option to specify an alternate directory.
840
841 See "DTD Handling" for more information
842
843 DTD_base <path_to_DTD_directory>
844 If the DTD is in a different directory, looks for it there,
845 useful to make up somewhat for the lack of catalog suport in
846 "expat". You still need a SYSTEM declaration
847
848 DTD_handler
849 Set a handler that will be called once the doctype (and the
850 DTD) have been loaded, with 2 arguments, the twig and the DTD.
851
852 no_prolog
853 Does not output a prolog (XML declaration and DTD)
854
855 id This optional argument gives the name of an attribute that can
856 be used as an ID in the document. Elements whose ID is known
857 can be accessed through the elt_id method. id defaults to 'id'.
858 See "BUGS "
859
860 discard_spaces
861 If this optional argument is set to a true value then spaces
862 are discarded when they look non-significant: strings
863 containing only spaces and at least one line feed are
864 discarded. This argument is set to true by default.
865
866 The exact algorithm to drop spaces is: strings including only
867 spaces (perl \s) and at least one \n right before an open or
868 close tag are dropped.
869
870 discard_all_spaces
871 If this argument is set to a true value, spaces are discarded
872 more aggressively than with "discard_spaces": strings not
873 including a \n are also dropped. This option is appropriate for
874 data-oriented XML.
875
876 keep_spaces
877 If this optional argument is set to a true value then all
878 spaces in the document are kept, and stored as "PCDATA".
879
880 Warning: adding this option can result in changes in the twig
881 generated: space that was previously discarded might end up in
882 a new text element. see the difference by calling the following
883 code with 0 and 1 as arguments:
884
885 perl -MXML::Twig -e'print XML::Twig->new( keep_spaces => shift)->parse( "<d> \n<e/></d>")->_dump'
886
887 "keep_spaces" and "discard_spaces" cannot be both set.
888
889 discard_spaces_in
890 This argument sets "keep_spaces" to true but will cause the
891 twig builder to discard spaces in the elements listed.
892
893 The syntax for using this argument is:
894
895 XML::Twig->new( discard_spaces_in => [ 'elt1', 'elt2']);
896
897 keep_spaces_in
898 This argument sets "discard_spaces" to true but will cause the
899 twig builder to keep spaces in the elements listed.
900
901 The syntax for using this argument is:
902
903 XML::Twig->new( keep_spaces_in => [ 'elt1', 'elt2']);
904
905 Warning: adding this option can result in changes in the twig
906 generated: space that was previously discarded might end up in
907 a new text element.
908
909 pretty_print
910 Set the pretty print method, amongst '"none"' (default),
911 '"nsgmls"', '"nice"', '"indented"', '"indented_c"',
912 '"indented_a"', '"indented_close_tag"', '"cvs"', '"wrapped"',
913 '"record"' and '"record_c"'
914
915 pretty_print formats:
916
917 none
918 The document is output as one ling string, with no line
919 breaks except those found within text elements
920
921 nsgmls
922 Line breaks are inserted in safe places: that is within
923 tags, between a tag and an attribute, between attributes
924 and before the > at the end of a tag.
925
926 This is quite ugly but better than "none", and it is very
927 safe, the document will still be valid (conforming to its
928 DTD).
929
930 This is how the SGML parser "sgmls" splits documents, hence
931 the name.
932
933 nice
934 This option inserts line breaks before any tag that does
935 not contain text (so element with textual content are not
936 broken as the \n is the significant).
937
938 WARNING: this option leaves the document well-formed but
939 might make it invalid (not conformant to its DTD). If you
940 have elements declared as
941
942 <!ELEMENT foo (#PCDATA|bar)>
943
944 then a "foo" element including a "bar" one will be printed
945 as
946
947 <foo>
948 <bar>bar is just pcdata</bar>
949 </foo>
950
951 This is invalid, as the parser will take the line break
952 after the "foo" tag as a sign that the element contains
953 PCDATA, it will then die when it finds the "bar" tag. This
954 may or may not be important for you, but be aware of it!
955
956 indented
957 Same as "nice" (and with the same warning) but indents
958 elements according to their level
959
960 indented_c
961 Same as "indented" but a little more compact: the closing
962 tags are on the same line as the preceding text
963
964 indented_close_tag
965 Same as "indented" except that the closing tag is also
966 indented, to line up with the tags within the element
967
968 idented_a
969 This formats XML files in a line-oriented version control
970 friendly way. The format is described in
971 <http://tinyurl.com/2kwscq> (that's an Oracle document with
972 an insanely long URL).
973
974 Note that to be totaly conformant to the "spec", the order
975 of attributes should not be changed, so if they are not
976 already in alphabetical order you will need to use the
977 "keep_atts_order" option.
978
979 cvs Same as "idented_a".
980
981 wrapped
982 Same as "indented_c" but lines are wrapped using
983 Text::Wrap::wrap. The default length for lines is the
984 default for $Text::Wrap::columns, and can be changed by
985 changing that variable.
986
987 record
988 This is a record-oriented pretty print, that display data
989 in records, one field per line (which looks a LOT like
990 "indented")
991
992 record_c
993 Stands for record compact, one record per line
994
995 empty_tags
996 Set the empty tag display style ('"normal"', '"html"' or
997 '"expand"').
998
999 "normal" outputs an empty tag '"<tag/>"', "html" adds a space
1000 '"<tag />"' for elements that can be empty in XHTML and
1001 "expand" outputs '"<tag></tag>"'
1002
1003 quote
1004 Set the quote character for attributes ('"single"' or
1005 '"double"').
1006
1007 escape_gt
1008 By default XML::Twig does not escape the character > in its
1009 output, as it is not mandated by the XML spec. With this option
1010 on, > will be replaced by ">"
1011
1012 comments
1013 Set the way comments are processed: '"drop"' (default),
1014 '"keep"' or '"process"'
1015
1016 Comments processing options:
1017
1018 drop
1019 drops the comments, they are not read, nor printed to the
1020 output
1021
1022 keep
1023 comments are loaded and will appear on the output, they are
1024 not accessible within the twig and will not interfere with
1025 processing though
1026
1027 Note: comments in the middle of a text element such as
1028
1029 <p>text <!-- comment --> more text --></p>
1030
1031 are kept at their original position in the text. Using
1032 ˝"print" methods like "print" or "sprint" will return the
1033 comments in the text. Using "text" or "field" on the other
1034 hand will not.
1035
1036 Any use of "set_pcdata" on the "#PCDATA" element (directly
1037 or through other methods like "set_content") will delete
1038 the comment(s).
1039
1040 process
1041 comments are loaded in the twig and will be treated as
1042 regular elements (their "tag" is "#COMMENT") this can
1043 interfere with processing if you expect
1044 "$elt->{first_child}" to be an element but find a comment
1045 there. Validation will not protect you from this as
1046 comments can happen anywhere. You can use
1047 "$elt->first_child( 'tag')" (which is a good habit anyway)
1048 to get where you want.
1049
1050 Consider using "process" if you are outputting SAX events
1051 from XML::Twig.
1052
1053 pi Set the way processing instructions are processed: '"drop"',
1054 '"keep"' (default) or '"process"'
1055
1056 Note that you can also set PI handlers in the "twig_handlers"
1057 option:
1058
1059 '?' => \&handler
1060 '?target' => \&handler 2
1061
1062 The handlers will be called with 2 parameters, the twig and the
1063 PI element if "pi" is set to "process", and with 3, the twig,
1064 the target and the data if "pi" is set to "keep". Of course
1065 they will not be called if "pi" is set to "drop".
1066
1067 If "pi" is set to "keep" the handler should return a string
1068 that will be used as-is as the PI text (it should look like ""
1069 <?target data?" >" or '' if you want to remove the PI),
1070
1071 Only one handler will be called, "?target" or "?" if no
1072 specific handler for that target is available.
1073
1074 map_xmlns
1075 This option is passed a hashref that maps uri's to prefixes.
1076 The prefixes in the document will be replaced by the ones in
1077 the map. The mapped prefixes can (actually have to) be used to
1078 trigger handlers, navigate or query the document.
1079
1080 Here is an example:
1081
1082 my $t= XML::Twig->new( map_xmlns => {'http://www.w3.org/2000/svg' => "svg"},
1083 twig_handlers =>
1084 { 'svg:circle' => sub { $_->set_att( r => 20) } },
1085 pretty_print => 'indented',
1086 )
1087 ->parse( '<doc xmlns:gr="http://www.w3.org/2000/svg">
1088 <gr:circle cx="10" cy="90" r="10"/>
1089 </doc>'
1090 )
1091 ->print;
1092
1093 This will output:
1094
1095 <doc xmlns:svg="http://www.w3.org/2000/svg">
1096 <svg:circle cx="10" cy="90" r="20"/>
1097 </doc>
1098
1099 keep_original_prefix
1100 When used with "map_xmlns" this option will make "XML::Twig"
1101 use the original namespace prefixes when outputting a document.
1102 The mapped prefix will still be used for triggering handlers
1103 and in navigation and query methods.
1104
1105 my $t= XML::Twig->new( map_xmlns => {'http://www.w3.org/2000/svg' => "svg"},
1106 twig_handlers =>
1107 { 'svg:circle' => sub { $_->set_att( r => 20) } },
1108 keep_original_prefix => 1,
1109 pretty_print => 'indented',
1110 )
1111 ->parse( '<doc xmlns:gr="http://www.w3.org/2000/svg">
1112 <gr:circle cx="10" cy="90" r="10"/>
1113 </doc>'
1114 )
1115 ->print;
1116
1117 This will output:
1118
1119 <doc xmlns:gr="http://www.w3.org/2000/svg">
1120 <gr:circle cx="10" cy="90" r="20"/>
1121 </doc>
1122
1123 original_uri ($prefix)
1124 called within a handler, this will return the uri bound to the
1125 namespace prefix in the original document.
1126
1127 index ($arrayref or $hashref)
1128 This option creates lists of specific elements during the
1129 parsing of the XML. It takes a reference to either a list of
1130 triggering expressions or to a hash name => expression, and for
1131 each one generates the list of elements that match the
1132 expression. The list can be accessed through the "index"
1133 method.
1134
1135 example:
1136
1137 # using an array ref
1138 my $t= XML::Twig->new( index => [ 'div', 'table' ])
1139 ->parsefile( "foo.xml");
1140 my $divs= $t->index( 'div');
1141 my $first_div= $divs->[0];
1142 my $last_table= $t->index( table => -1);
1143
1144 # using a hashref to name the indexes
1145 my $t= XML::Twig->new( index => { email => 'a[@href=~/^ \s*mailto:/]'})
1146 ->parsefile( "foo.xml");
1147 my $last_emails= $t->index( email => -1);
1148
1149 Note that the index is not maintained after the parsing. If
1150 elements are deleted, renamed or otherwise hurt during
1151 processing, the index is NOT updated. (changing the id element
1152 OTOH will update the index)
1153
1154 att_accessors <list of attribute names>
1155 creates methods that give direct access to attribute:
1156
1157 my $t= XML::Twig->new( att_accessors => [ 'href', 'src'])
1158 ->parsefile( $file);
1159 my $first_href= $t->first_elt( 'img')->src; # same as ->att( 'src')
1160 $t->first_elt( 'img')->src( 'new_logo.png') # changes the attribute value
1161
1162 elt_accessors
1163 creates methods that give direct access to the first child
1164 element (in scalar context) or the list of elements (in list
1165 context):
1166
1167 the list of accessors to create can be given 1 2 different
1168 ways: in an array, or in a hash alias => expression
1169 my $t= XML::Twig->new( elt_accessors => [ 'head'])
1170 ->parsefile( $file);
1171 my $title_text= $t->root->head->field( 'title');
1172 # same as $title_text= $t->root->first_child( 'head')->field(
1173 'title');
1174
1175 my $t= XML::Twig->new( elt_accessors => { warnings => 'p[@class="warning"]', d2 => 'div[2]'}, )
1176 ->parsefile( $file);
1177 my $body= $t->first_elt( 'body');
1178 my @warnings= $body->warnings; # same as $body->children( 'p[@class="warning"]');
1179 my $s2= $body->d2; # same as $body->first_child( 'div[2]')
1180
1181 field_accessors
1182 creates methods that give direct access to the first child
1183 element text:
1184
1185 my $t= XML::Twig->new( field_accessors => [ 'h1'])
1186 ->parsefile( $file);
1187 my $div_title_text= $t->first_elt( 'div')->title;
1188 # same as $title_text= $t->first_elt( 'div')->field( 'title');
1189
1190 use_tidy
1191 set this option to use HTML::Tidy instead of HTML::TreeBuilder
1192 to convert HTML to XML. HTML, especially real (real "crap")
1193 HTML found in the wild, so depending on the data, one module or
1194 the other does a better job at the conversion. Also, HTML::Tidy
1195 can be a bit difficult to install, so XML::Twig offers both
1196 option. TIMTOWTDI
1197
1198 output_html_doctype
1199 when using HTML::TreeBuilder to convert HTML, this option
1200 causes the DOCTYPE declaration to be output, which may be
1201 important for some legacy browsers. Without that option the
1202 DOCTYPE definition is NOT output. Also if the definition is
1203 completely wrong (ie not easily parsable), it is not output
1204 either.
1205
1206 Note: I _HATE_ the Java-like name of arguments used by most XML
1207 modules. So in pure TIMTOWTDI fashion all arguments can be written
1208 either as "UglyJavaLikeName" or as "readable_perl_name":
1209 "twig_print_outside_roots" or "TwigPrintOutsideRoots" (or even
1210 "twigPrintOutsideRoots" {shudder}). XML::Twig normalizes them
1211 before processing them.
1212
1213 parse ( $source)
1214 The $source parameter should either be a string containing the
1215 whole XML document, or it should be an open "IO::Handle" (aka a
1216 filehandle).
1217
1218 A die call is thrown if a parse error occurs. Otherwise it will
1219 return the twig built by the parse. Use "safe_parse" if you want
1220 the parsing to return even when an error occurs.
1221
1222 If this method is called as a class method ("XML::Twig->parse(
1223 $some_xml_or_html)") then an XML::Twig object is created, using the
1224 parameters except the last one (eg "XML::Twig->parse( pretty_print
1225 => 'indented', $some_xml_or_html)") and "xparse" is called on it.
1226
1227 Note that when parsing a filehandle, the handle should NOT be open
1228 with an encoding (ie open with "open( my $in, '<', $filename)". The
1229 file will be parsed by "expat", so specifying the encoding actually
1230 causes problems for the parser (as in: it can crash it, see
1231 https://rt.cpan.org/Ticket/Display.html?id=78877). For parsing a
1232 file it is actually recommended to use "parsefile" on the file
1233 name, instead of <parse> on the open file.
1234
1235 parsestring
1236 This is just an alias for "parse" for backwards compatibility.
1237
1238 parsefile (FILE [, OPT => OPT_VALUE [...]])
1239 Open "FILE" for reading, then call "parse" with the open handle.
1240 The file is closed no matter how "parse" returns.
1241
1242 A "die" call is thrown if a parse error occurs. Otherwise it will
1243 return the twig built by the parse. Use "safe_parsefile" if you
1244 want the parsing to return even when an error occurs.
1245
1246 parsefile_inplace ( $file, $optional_extension)
1247 Parse and update a file "in place". It does this by creating a temp
1248 file, selecting it as the default for print() statements (and
1249 methods), then parsing the input file. If the parsing is
1250 successful, then the temp file is moved to replace the input file.
1251
1252 If an extension is given then the original file is backed-up (the
1253 rules for the extension are the same as the rule for the -i option
1254 in perl).
1255
1256 parsefile_html_inplace ( $file, $optional_extension)
1257 Same as parsefile_inplace, except that it parses HTML instead of
1258 XML
1259
1260 parseurl ($url $optional_user_agent)
1261 Gets the data from $url and parse it. The data is piped to the
1262 parser in chunks the size of the XML::Parser::Expat buffer, so
1263 memory consumption and hopefully speed are optimal.
1264
1265 For most (read "small") XML it is probably as efficient (and easier
1266 to debug) to just "get" the XML file and then parse it as a string.
1267
1268 use XML::Twig;
1269 use LWP::Simple;
1270 my $twig= XML::Twig->new();
1271 $twig->parse( LWP::Simple::get( $URL ));
1272
1273 or
1274
1275 use XML::Twig;
1276 my $twig= XML::Twig->nparse( $URL);
1277
1278 If the $optional_user_agent argument is used then it is used,
1279 otherwise a new one is created.
1280
1281 safe_parse ( SOURCE [, OPT => OPT_VALUE [...]])
1282 This method is similar to "parse" except that it wraps the parsing
1283 in an "eval" block. It returns the twig on success and 0 on failure
1284 (the twig object also contains the parsed twig). $@ contains the
1285 error message on failure.
1286
1287 Note that the parsing still stops as soon as an error is detected,
1288 there is no way to keep going after an error.
1289
1290 safe_parsefile (FILE [, OPT => OPT_VALUE [...]])
1291 This method is similar to "parsefile" except that it wraps the
1292 parsing in an "eval" block. It returns the twig on success and 0 on
1293 failure (the twig object also contains the parsed twig) . $@
1294 contains the error message on failure
1295
1296 Note that the parsing still stops as soon as an error is detected,
1297 there is no way to keep going after an error.
1298
1299 safe_parseurl ($url $optional_user_agent)
1300 Same as "parseurl" except that it wraps the parsing in an "eval"
1301 block. It returns the twig on success and 0 on failure (the twig
1302 object also contains the parsed twig) . $@ contains the error
1303 message on failure
1304
1305 parse_html ($string_or_fh)
1306 parse an HTML string or file handle (by converting it to XML using
1307 HTML::TreeBuilder, which needs to be available).
1308
1309 This works nicely, but some information gets lost in the process:
1310 newlines are removed, and (at least on the version I use), comments
1311 get an extra CDATA section inside ( <!-- foo --> becomes <!--
1312 <![CDATA[ foo ]]> -->
1313
1314 parsefile_html ($file)
1315 parse an HTML file (by converting it to XML using
1316 HTML::TreeBuilder, which needs to be available, or HTML::Tidy if
1317 the "use_tidy" option was used). The file is loaded completely in
1318 memory and converted to XML before being parsed.
1319
1320 this method is to be used with caution though, as it doesn't know
1321 about the file encoding, it is usually better to use "parse_html",
1322 which gives you a chance to open the file with the proper encoding
1323 layer.
1324
1325 parseurl_html ($url $optional_user_agent)
1326 parse an URL as html the same way "parse_html" does
1327
1328 safe_parseurl_html ($url $optional_user_agent)
1329 Same as "parseurl_html"> except that it wraps the parsing in an
1330 "eval" block. It returns the twig on success and 0 on failure (the
1331 twig object also contains the parsed twig) . $@ contains the error
1332 message on failure
1333
1334 safe_parsefile_html ($file $optional_user_agent)
1335 Same as "parsefile_html"> except that it wraps the parsing in an
1336 "eval" block. It returns the twig on success and 0 on failure (the
1337 twig object also contains the parsed twig) . $@ contains the error
1338 message on failure
1339
1340 safe_parse_html ($string_or_fh)
1341 Same as "parse_html" except that it wraps the parsing in an "eval"
1342 block. It returns the twig on success and 0 on failure (the twig
1343 object also contains the parsed twig) . $@ contains the error
1344 message on failure
1345
1346 xparse ($thing_to_parse)
1347 parse the $thing_to_parse, whether it is a filehandle, a string, an
1348 HTML file, an HTML URL, an URL or a file.
1349
1350 Note that this is mostly a convenience method for one-off scripts.
1351 For example files that end in '.htm' or '.html' are parsed first as
1352 XML, and if this fails as HTML. This is certainly not the most
1353 efficient way to do this in general.
1354
1355 nparse ($optional_twig_options, $thing_to_parse)
1356 create a twig with the $optional_options, and parse the
1357 $thing_to_parse, whether it is a filehandle, a string, an HTML
1358 file, an HTML URL, an URL or a file.
1359
1360 Examples:
1361
1362 XML::Twig->nparse( "file.xml");
1363 XML::Twig->nparse( error_context => 1, "file://file.xml");
1364
1365 nparse_pp ($optional_twig_options, $thing_to_parse)
1366 same as "nparse" but also sets the "pretty_print" option to
1367 "indented".
1368
1369 nparse_e ($optional_twig_options, $thing_to_parse)
1370 same as "nparse" but also sets the "error_context" option to 1.
1371
1372 nparse_ppe ($optional_twig_options, $thing_to_parse)
1373 same as "nparse" but also sets the "pretty_print" option to
1374 "indented" and the "error_context" option to 1.
1375
1376 parser
1377 This method returns the "expat" object (actually the
1378 XML::Parser::Expat object) used during parsing. It is useful for
1379 example to call XML::Parser::Expat methods on it. To get the line
1380 of a tag for example use "$t->parser->current_line".
1381
1382 setTwigHandlers ($handlers)
1383 Set the twig_handlers. $handlers is a reference to a hash similar
1384 to the one in the "twig_handlers" option of new. All previous
1385 handlers are unset. The method returns the reference to the
1386 previous handlers.
1387
1388 setTwigHandler ($exp $handler)
1389 Set a single twig_handler for elements matching $exp. $handler is a
1390 reference to a subroutine. If the handler was previously set then
1391 the reference to the previous handler is returned.
1392
1393 setStartTagHandlers ($handlers)
1394 Set the start_tag handlers. $handlers is a reference to a hash
1395 similar to the one in the "start_tag_handlers" option of new. All
1396 previous handlers are unset. The method returns the reference to
1397 the previous handlers.
1398
1399 setStartTagHandler ($exp $handler)
1400 Set a single start_tag handlers for elements matching $exp.
1401 $handler is a reference to a subroutine. If the handler was
1402 previously set then the reference to the previous handler is
1403 returned.
1404
1405 setEndTagHandlers ($handlers)
1406 Set the end_tag handlers. $handlers is a reference to a hash
1407 similar to the one in the "end_tag_handlers" option of new. All
1408 previous handlers are unset. The method returns the reference to
1409 the previous handlers.
1410
1411 setEndTagHandler ($exp $handler)
1412 Set a single end_tag handlers for elements matching $exp. $handler
1413 is a reference to a subroutine. If the handler was previously set
1414 then the reference to the previous handler is returned.
1415
1416 setTwigRoots ($handlers)
1417 Same as using the "twig_roots" option when creating the twig
1418
1419 setCharHandler ($exp $handler)
1420 Set a "char_handler"
1421
1422 setIgnoreEltsHandler ($exp)
1423 Set a "ignore_elt" handler (elements that match $exp will be
1424 ignored
1425
1426 setIgnoreEltsHandlers ($exp)
1427 Set all "ignore_elt" handlers (previous handlers are replaced)
1428
1429 dtd Return the dtd (an XML::Twig::DTD object) of a twig
1430
1431 xmldecl
1432 Return the XML declaration for the document, or a default one if it
1433 doesn't have one
1434
1435 doctype
1436 Return the doctype for the document
1437
1438 doctype_name
1439 returns the doctype of the document from the doctype declaration
1440
1441 system_id
1442 returns the system value of the DTD of the document from the
1443 doctype declaration
1444
1445 public_id
1446 returns the public doctype of the document from the doctype
1447 declaration
1448
1449 internal_subset
1450 returns the internal subset of the DTD
1451
1452 dtd_text
1453 Return the DTD text
1454
1455 dtd_print
1456 Print the DTD
1457
1458 model ($tag)
1459 Return the model (in the DTD) for the element $tag
1460
1461 root
1462 Return the root element of a twig
1463
1464 set_root ($elt)
1465 Set the root of a twig
1466
1467 first_elt ($optional_condition)
1468 Return the first element matching $optional_condition of a twig, if
1469 no condition is given then the root is returned
1470
1471 last_elt ($optional_condition)
1472 Return the last element matching $optional_condition of a twig, if
1473 no condition is given then the last element of the twig is returned
1474
1475 elt_id ($id)
1476 Return the element whose "id" attribute is $id
1477
1478 getEltById
1479 Same as "elt_id"
1480
1481 index ($index_name, $optional_index)
1482 If the $optional_index argument is present, return the
1483 corresponding element in the index (created using the "index"
1484 option for "XML::Twig-"new>)
1485
1486 If the argument is not present, return an arrayref to the index
1487
1488 normalize
1489 merge together all consecutive pcdata elements in the document (if
1490 for example you have turned some elements into pcdata using
1491 "erase", this will give you a "clean" document in which there all
1492 text elements are as long as possible).
1493
1494 encoding
1495 This method returns the encoding of the XML document, as defined by
1496 the "encoding" attribute in the XML declaration (ie it is "undef"
1497 if the attribute is not defined)
1498
1499 set_encoding
1500 This method sets the value of the "encoding" attribute in the XML
1501 declaration. Note that if the document did not have a declaration
1502 it is generated (with an XML version of 1.0)
1503
1504 xml_version
1505 This method returns the XML version, as defined by the "version"
1506 attribute in the XML declaration (ie it is "undef" if the attribute
1507 is not defined)
1508
1509 set_xml_version
1510 This method sets the value of the "version" attribute in the XML
1511 declaration. If the declaration did not exist it is created.
1512
1513 standalone
1514 This method returns the value of the "standalone" declaration for
1515 the document
1516
1517 set_standalone
1518 This method sets the value of the "standalone" attribute in the XML
1519 declaration. Note that if the document did not have a declaration
1520 it is generated (with an XML version of 1.0)
1521
1522 set_output_encoding
1523 Set the "encoding" "attribute" in the XML declaration
1524
1525 set_doctype ($name, $system, $public, $internal)
1526 Set the doctype of the element. If an argument is "undef" (or not
1527 present) then its former value is retained, if a false ('' or 0)
1528 value is passed then the former value is deleted;
1529
1530 entity_list
1531 Return the entity list of a twig
1532
1533 entity_names
1534 Return the list of all defined entities
1535
1536 entity ($entity_name)
1537 Return the entity
1538
1539 notation_list
1540 Return the notation list of a twig
1541
1542 notation_names
1543 Return the list of all defined notations
1544
1545 notation ($notation_name)
1546 Return the notation
1547
1548 change_gi ($old_gi, $new_gi)
1549 Performs a (very fast) global change. All elements $old_gi are now
1550 $new_gi. This is a bit dangerous though and should be avoided if <
1551 possible, as the new tag might be ignored in subsequent processing.
1552
1553 See "BUGS "
1554
1555 flush ($optional_filehandle, %options)
1556 Flushes a twig up to (and including) the current element, then
1557 deletes all unnecessary elements from the tree that's kept in
1558 memory. "flush" keeps track of which elements need to be
1559 open/closed, so if you flush from handlers you don't have to worry
1560 about anything. Just keep flushing the twig every time you're done
1561 with a sub-tree and it will come out well-formed. After the whole
1562 parsing don't forget to"flush" one more time to print the end of
1563 the document. The doctype and entity declarations are also
1564 printed.
1565
1566 flush take an optional filehandle as an argument.
1567
1568 If you use "flush" at any point during parsing, the document will
1569 be flushed one last time at the end of the parsing, to the proper
1570 filehandle.
1571
1572 options: use the "update_DTD" option if you have updated the
1573 (internal) DTD and/or the entity list and you want the updated DTD
1574 to be output
1575
1576 The "pretty_print" option sets the pretty printing of the document.
1577
1578 Example: $t->flush( Update_DTD => 1);
1579 $t->flush( $filehandle, pretty_print => 'indented');
1580 $t->flush( \*FILE);
1581
1582 flush_up_to ($elt, $optional_filehandle, %options)
1583 Flushes up to the $elt element. This allows you to keep part of the
1584 tree in memory when you "flush".
1585
1586 options: see flush.
1587
1588 purge
1589 Does the same as a "flush" except it does not print the twig. It
1590 just deletes all elements that have been completely parsed so far.
1591
1592 purge_up_to ($elt)
1593 Purges up to the $elt element. This allows you to keep part of the
1594 tree in memory when you "purge".
1595
1596 print ($optional_filehandle, %options)
1597 Prints the whole document associated with the twig. To be used only
1598 AFTER the parse.
1599
1600 options: see "flush".
1601
1602 print_to_file ($filename, %options)
1603 Prints the whole document associated with the twig to file
1604 $filename. To be used only AFTER the parse.
1605
1606 options: see "flush".
1607
1608 safe_print_to_file ($filename, %options)
1609 Prints the whole document associated with the twig to file
1610 $filename. This variant, which probably only works on *nix prints
1611 to a temp file, then move the temp file to overwrite the original
1612 file.
1613
1614 This is a bit safer when 2 processes an potentiallywrite the same
1615 file: only the last one will succeed, but the file won't be
1616 corruted. I often use this for cron jobs, so testing the code
1617 doesn't interfere with the cron job running at the same time.
1618
1619 options: see "flush".
1620
1621 sprint
1622 Return the text of the whole document associated with the twig. To
1623 be used only AFTER the parse.
1624
1625 options: see "flush".
1626
1627 trim
1628 Trim the document: gets rid of initial and trailing spaces, and
1629 replaces multiple spaces by a single one.
1630
1631 toSAX1 ($handler)
1632 Send SAX events for the twig to the SAX1 handler $handler
1633
1634 toSAX2 ($handler)
1635 Send SAX events for the twig to the SAX2 handler $handler
1636
1637 flush_toSAX1 ($handler)
1638 Same as flush, except that SAX events are sent to the SAX1 handler
1639 $handler instead of the twig being printed
1640
1641 flush_toSAX2 ($handler)
1642 Same as flush, except that SAX events are sent to the SAX2 handler
1643 $handler instead of the twig being printed
1644
1645 ignore
1646 This method should be called during parsing, usually in
1647 "start_tag_handlers". It causes the element to be skipped during
1648 the parsing: the twig is not built for this element, it will not be
1649 accessible during parsing or after it. The element will not take up
1650 any memory and parsing will be faster.
1651
1652 Note that this method can also be called on an element. If the
1653 element is a parent of the current element then this element will
1654 be ignored (the twig will not be built any more for it and what has
1655 already been built will be deleted).
1656
1657 set_pretty_print ($style)
1658 Set the pretty print method, amongst '"none"' (default),
1659 '"nsgmls"', '"nice"', '"indented"', "indented_c", '"wrapped"',
1660 '"record"' and '"record_c"'
1661
1662 WARNING: the pretty print style is a GLOBAL variable, so once set
1663 it's applied to ALL "print"'s (and "sprint"'s). Same goes if you
1664 use XML::Twig with "mod_perl" . This should not be a problem as the
1665 XML that's generated is valid anyway, and XML processors (as well
1666 as HTML processors, including browsers) should not care. Let me
1667 know if this is a big problem, but at the moment the
1668 performance/cleanliness trade-off clearly favors the global
1669 approach.
1670
1671 set_empty_tag_style ($style)
1672 Set the empty tag display style ('"normal"', '"html"' or
1673 '"expand"'). As with "set_pretty_print" this sets a global flag.
1674
1675 "normal" outputs an empty tag '"<tag/>"', "html" adds a space
1676 '"<tag />"' for elements that can be empty in XHTML and "expand"
1677 outputs '"<tag></tag>"'
1678
1679 set_remove_cdata ($flag)
1680 set (or unset) the flag that forces the twig to output CDATA
1681 sections as regular (escaped) PCDATA
1682
1683 print_prolog ($optional_filehandle, %options)
1684 Prints the prolog (XML declaration + DTD + entity declarations) of
1685 a document.
1686
1687 options: see "flush".
1688
1689 prolog ($optional_filehandle, %options)
1690 Return the prolog (XML declaration + DTD + entity declarations) of
1691 a document.
1692
1693 options: see "flush".
1694
1695 finish
1696 Call Expat "finish" method. Unsets all handlers (including
1697 internal ones that set context), but expat continues parsing to the
1698 end of the document or until it finds an error. It should finish
1699 up a lot faster than with the handlers set.
1700
1701 finish_print
1702 Stops twig processing, flush the twig and proceed to finish
1703 printing the document as fast as possible. Use this method when
1704 modifying a document and the modification is done.
1705
1706 finish_now
1707 Stops twig processing, does not finish parsing the document (which
1708 could actually be not well-formed after the point where
1709 "finish_now" is called). Execution resumes after the "Lparse"> or
1710 "parsefile" call. The content of the twig is what has been parsed
1711 so far (all open elements at the time "finish_now" is called are
1712 considered closed).
1713
1714 set_expand_external_entities
1715 Same as using the "expand_external_ents" option when creating the
1716 twig
1717
1718 set_input_filter
1719 Same as using the "input_filter" option when creating the twig
1720
1721 set_keep_atts_order
1722 Same as using the "keep_atts_order" option when creating the twig
1723
1724 set_keep_encoding
1725 Same as using the "keep_encoding" option when creating the twig
1726
1727 escape_gt
1728 usually XML::Twig does not escape > in its output. Using this
1729 option makes it replace > by >
1730
1731 do_not_escape_gt
1732 reverts XML::Twig behavior to its default of not escaping > in its
1733 output.
1734
1735 set_output_filter
1736 Same as using the "output_filter" option when creating the twig
1737
1738 set_output_text_filter
1739 Same as using the "output_text_filter" option when creating the
1740 twig
1741
1742 add_stylesheet ($type, @options)
1743 Adds an external stylesheet to an XML document.
1744
1745 Supported types and options:
1746
1747 xsl option: the url of the stylesheet
1748
1749 Example:
1750
1751 $t->add_stylesheet( xsl => "xsl_style.xsl");
1752
1753 will generate the following PI at the beginning of the
1754 document:
1755
1756 <?xml-stylesheet type="text/xsl" href="xsl_style.xsl"?>
1757
1758 css option: the url of the stylesheet
1759
1760 active_twig
1761 a class method that returns the last processed twig, so you
1762 don't necessarily need the object to call methods on it.
1763
1764 Methods inherited from XML::Parser::Expat
1765 A twig inherits all the relevant methods from XML::Parser::Expat.
1766 These methods can only be used during the parsing phase (they will
1767 generate a fatal error otherwise).
1768
1769 Inherited methods are:
1770
1771 depth
1772 Returns the size of the context list.
1773
1774 in_element
1775 Returns true if NAME is equal to the name of the innermost cur‐
1776 rently opened element. If namespace processing is being used
1777 and you want to check against a name that may be in a
1778 namespace, then use the generate_ns_name method to create the
1779 NAME argument.
1780
1781 within_element
1782 Returns the number of times the given name appears in the
1783 context list. If namespace processing is being used and you
1784 want to check against a name that may be in a namespace, then
1785 use the gener‐ ate_ns_name method to create the NAME argument.
1786
1787 context
1788 Returns a list of element names that represent open elements,
1789 with the last one being the innermost. Inside start and end tag
1790 han‐ dlers, this will be the tag of the parent element.
1791
1792 current_line
1793 Returns the line number of the current position of the parse.
1794
1795 current_column
1796 Returns the column number of the current position of the parse.
1797
1798 current_byte
1799 Returns the current position of the parse.
1800
1801 position_in_context
1802 Returns a string that shows the current parse position. LINES
1803 should be an integer >= 0 that represents the number of lines
1804 on either side of the current parse line to place into the
1805 returned string.
1806
1807 base ([NEWBASE])
1808 Returns the current value of the base for resolving relative
1809 URIs. If NEWBASE is supplied, changes the base to that value.
1810
1811 current_element
1812 Returns the name of the innermost currently opened element.
1813 Inside start or end handlers, returns the parent of the element
1814 associated with those tags.
1815
1816 element_index
1817 Returns an integer that is the depth-first visit order of the
1818 cur‐ rent element. This will be zero outside of the root
1819 element. For example, this will return 1 when called from the
1820 start handler for the root element start tag.
1821
1822 recognized_string
1823 Returns the string from the document that was recognized in
1824 order to call the current handler. For instance, when called
1825 from a start handler, it will give us the start-tag string. The
1826 string is encoded in UTF-8. This method doesn't return a
1827 meaningful string inside declaration handlers.
1828
1829 original_string
1830 Returns the verbatim string from the document that was
1831 recognized in order to call the current handler. The string is
1832 in the original document encoding. This method doesn't return a
1833 meaningful string inside declaration handlers.
1834
1835 xpcroak
1836 Concatenate onto the given message the current line number
1837 within the XML document plus the message implied by
1838 ErrorContext. Then croak with the formed message.
1839
1840 xpcarp
1841 Concatenate onto the given message the current line number
1842 within the XML document plus the message implied by
1843 ErrorContext. Then carp with the formed message.
1844
1845 xml_escape(TEXT [, CHAR [, CHAR ...]])
1846 Returns TEXT with markup characters turned into character
1847 entities. Any additional characters provided as arguments are
1848 also turned into character references where found in TEXT.
1849
1850 (this method is broken on some versions of expat/XML::Parser)
1851
1852 path ( $optional_tag)
1853 Return the element context in a form similar to XPath's short form:
1854 '"/root/tag1/../tag"'
1855
1856 get_xpath ( $optional_array_ref, $xpath, $optional_offset)
1857 Performs a "get_xpath" on the document root (see <Elt|"Elt">)
1858
1859 If the $optional_array_ref argument is used the array must contain
1860 elements. The $xpath expression is applied to each element in turn
1861 and the result is union of all results. This way a first query can
1862 be refined in further steps.
1863
1864 find_nodes ( $optional_array_ref, $xpath, $optional_offset)
1865 same as "get_xpath"
1866
1867 findnodes ( $optional_array_ref, $xpath, $optional_offset)
1868 same as "get_xpath" (similar to the XML::LibXML method)
1869
1870 findvalue ( $optional_array_ref, $xpath, $optional_offset)
1871 Return the "join" of all texts of the results of applying
1872 "get_xpath" to the node (similar to the XML::LibXML method)
1873
1874 findvalues ( $optional_array_ref, $xpath, $optional_offset)
1875 Return an array of all texts of the results of applying "get_xpath"
1876 to the node
1877
1878 subs_text ($regexp, $replace)
1879 subs_text does text substitution on the whole document, similar to
1880 perl's " s///" operator.
1881
1882 dispose
1883 Useful only if you don't have "Scalar::Util" or "WeakRef"
1884 installed.
1885
1886 Reclaims properly the memory used by an XML::Twig object. As the
1887 object has circular references it never goes out of scope, so if
1888 you want to parse lots of XML documents then the memory leak
1889 becomes a problem. Use "$twig->dispose" to clear this problem.
1890
1891 att_accessors (list_of_attribute_names)
1892 A convenience method that creates l-valued accessors for
1893 attributes. So "$twig->create_accessors( 'foo')" will create a
1894 "foo" method that can be called on elements:
1895
1896 $elt->foo; # equivalent to $elt->{'att'}->{'foo'};
1897 $elt->foo( 'bar'); # equivalent to $elt->set_att( foo => 'bar');
1898
1899 The methods are l-valued only under those perl's that support this
1900 feature (5.6 and above)
1901
1902 create_accessors (list_of_attribute_names)
1903 Same as att_accessors
1904
1905 elt_accessors (list_of_attribute_names)
1906 A convenience method that creates accessors for elements. So
1907 "$twig->create_accessors( 'foo')" will create a "foo" method that
1908 can be called on elements:
1909
1910 $elt->foo; # equivalent to $elt->first_child( 'foo');
1911
1912 field_accessors (list_of_attribute_names)
1913 A convenience method that creates accessors for element values
1914 ("field"). So "$twig->create_accessors( 'foo')" will create a
1915 "foo" method that can be called on elements:
1916
1917 $elt->foo; # equivalent to $elt->field( 'foo');
1918
1919 set_do_not_escape_amp_in_atts
1920 An evil method, that I only document because Test::Pod::Coverage
1921 complaints otherwise, but really, you don't want to know about it.
1922
1923 XML::Twig::Elt
1924 new ($optional_tag, $optional_atts, @optional_content)
1925 The "tag" is optional (but then you can't have a content ), the
1926 $optional_atts argument is a reference to a hash of attributes, the
1927 content can be just a string or a list of strings and element. A
1928 content of '"#EMPTY"' creates an empty element;
1929
1930 Examples: my $elt= XML::Twig::Elt->new();
1931 my $elt= XML::Twig::Elt->new( para => { align => 'center' });
1932 my $elt= XML::Twig::Elt->new( para => { align => 'center' }, 'foo');
1933 my $elt= XML::Twig::Elt->new( br => '#EMPTY');
1934 my $elt= XML::Twig::Elt->new( 'para');
1935 my $elt= XML::Twig::Elt->new( para => 'this is a para');
1936 my $elt= XML::Twig::Elt->new( para => $elt3, 'another para');
1937
1938 The strings are not parsed, the element is not attached to any
1939 twig.
1940
1941 WARNING: if you rely on ID's then you will have to set the id
1942 yourself. At this point the element does not belong to a twig yet,
1943 so the ID attribute is not known so it won't be stored in the ID
1944 list.
1945
1946 Note that "#COMMENT", "#PCDATA" or "#CDATA" are valid tag names,
1947 that will create text elements.
1948
1949 To create an element "foo" containing a CDATA section:
1950
1951 my $foo= XML::Twig::Elt->new( '#CDATA' => "content of the CDATA section")
1952 ->wrap_in( 'foo');
1953
1954 An attribute of '#CDATA', will create the content of the element as
1955 CDATA:
1956
1957 my $elt= XML::Twig::Elt->new( 'p' => { '#CDATA' => 1}, 'foo < bar');
1958
1959 creates an element
1960
1961 <p><![CDATA[foo < bar]]></>
1962
1963 parse ($string, %args)
1964 Creates an element from an XML string. The string is actually
1965 parsed as a new twig, then the root of that twig is returned. The
1966 arguments in %args are passed to the twig. As always if the parse
1967 fails the parser will die, so use an eval if you want to trap
1968 syntax errors.
1969
1970 As obviously the element does not exist beforehand this method has
1971 to be called on the class:
1972
1973 my $elt= parse XML::Twig::Elt( "<a> string to parse, with <sub/>
1974 <elements>, actually tons of </elements>
1975 h</a>");
1976
1977 set_inner_xml ($string)
1978 Sets the content of the element to be the tree created from the
1979 string
1980
1981 set_inner_html ($string)
1982 Sets the content of the element, after parsing the string with an
1983 HTML parser (HTML::Parser)
1984
1985 set_outer_xml ($string)
1986 Replaces the element with the tree created from the string
1987
1988 print ($optional_filehandle, $optional_pretty_print_style)
1989 Prints an entire element, including the tags, optionally to a
1990 $optional_filehandle, optionally with a $pretty_print_style.
1991
1992 The print outputs XML data so base entities are escaped.
1993
1994 print_to_file ($filename, %options)
1995 Prints the element to file $filename.
1996
1997 options: see "flush". =item sprint ($elt,
1998 $optional_no_enclosing_tag)
1999
2000 Return the xml string for an entire element, including the tags.
2001 If the optional second argument is true then only the string inside
2002 the element is returned (the start and end tag for $elt are not).
2003 The text is XML-escaped: base entities (& and < in text, & < and "
2004 in attribute values) are turned into entities.
2005
2006 gi Return the gi of the element (the gi is the "generic identifier"
2007 the tag name in SGML parlance).
2008
2009 "tag" and "name" are synonyms of "gi".
2010
2011 tag Same as "gi"
2012
2013 name
2014 Same as "tag"
2015
2016 set_gi ($tag)
2017 Set the gi (tag) of an element
2018
2019 set_tag ($tag)
2020 Set the tag (="tag") of an element
2021
2022 set_name ($name)
2023 Set the name (="tag") of an element
2024
2025 root
2026 Return the root of the twig in which the element is contained.
2027
2028 twig
2029 Return the twig containing the element.
2030
2031 parent ($optional_condition)
2032 Return the parent of the element, or the first ancestor matching
2033 the $optional_condition
2034
2035 first_child ($optional_condition)
2036 Return the first child of the element, or the first child matching
2037 the $optional_condition
2038
2039 has_child ($optional_condition)
2040 Return the first child of the element, or the first child matching
2041 the $optional_condition (same as first_child)
2042
2043 has_children ($optional_condition)
2044 Return the first child of the element, or the first child matching
2045 the $optional_condition (same as first_child)
2046
2047 first_child_text ($optional_condition)
2048 Return the text of the first child of the element, or the first
2049 child
2050 matching the $optional_condition If there is no first_child then
2051 returns ''. This avoids getting the child, checking for its
2052 existence then getting the text for trivial cases.
2053
2054 Similar methods are available for the other navigation methods:
2055
2056 last_child_text
2057 prev_sibling_text
2058 next_sibling_text
2059 prev_elt_text
2060 next_elt_text
2061 child_text
2062 parent_text
2063
2064 All this methods also exist in "trimmed" variant:
2065
2066 first_child_trimmed_text
2067 last_child_trimmed_text
2068 prev_sibling_trimmed_text
2069 next_sibling_trimmed_text
2070 prev_elt_trimmed_text
2071 next_elt_trimmed_text
2072 child_trimmed_text
2073 parent_trimmed_text
2074 field ($condition)
2075 Same method as "first_child_text" with a different name
2076
2077 fields ($condition_list)
2078 Return the list of field (text of first child matching the
2079 conditions), missing fields are returned as the empty string.
2080
2081 Same method as "first_child_text" with a different name
2082
2083 trimmed_field ($optional_condition)
2084 Same method as "first_child_trimmed_text" with a different name
2085
2086 set_field ($condition, $optional_atts, @list_of_elt_and_strings)
2087 Set the content of the first child of the element that matches
2088 $condition, the rest of the arguments is the same as for
2089 "set_content"
2090
2091 If no child matches $condition _and_ if $condition is a valid XML
2092 element name, then a new element by that name is created and
2093 inserted as the last child.
2094
2095 first_child_matches ($optional_condition)
2096 Return the element if the first child of the element (if it exists)
2097 passes the $optional_condition "undef" otherwise
2098
2099 if( $elt->first_child_matches( 'title')) ...
2100
2101 is equivalent to
2102
2103 if( $elt->{first_child} && $elt->{first_child}->passes( 'title'))
2104
2105 "first_child_is" is an other name for this method
2106
2107 Similar methods are available for the other navigation methods:
2108
2109 last_child_matches
2110 prev_sibling_matches
2111 next_sibling_matches
2112 prev_elt_matches
2113 next_elt_matches
2114 child_matches
2115 parent_matches
2116 is_first_child ($optional_condition)
2117 returns true (the element) if the element is the first child of its
2118 parent (optionally that satisfies the $optional_condition)
2119
2120 is_last_child ($optional_condition)
2121 returns true (the element) if the element is the last child of its
2122 parent (optionally that satisfies the $optional_condition)
2123
2124 prev_sibling ($optional_condition)
2125 Return the previous sibling of the element, or the previous sibling
2126 matching $optional_condition
2127
2128 next_sibling ($optional_condition)
2129 Return the next sibling of the element, or the first one matching
2130 $optional_condition.
2131
2132 next_elt ($optional_elt, $optional_condition)
2133 Return the next elt (optionally matching $optional_condition) of
2134 the element. This is defined as the next element which opens after
2135 the current element opens. Which usually means the first child of
2136 the element. Counter-intuitive as it might look this allows you to
2137 loop through the whole document by starting from the root.
2138
2139 The $optional_elt is the root of a subtree. When the "next_elt" is
2140 out of the subtree then the method returns undef. You can then walk
2141 a sub-tree with:
2142
2143 my $elt= $subtree_root;
2144 while( $elt= $elt->next_elt( $subtree_root))
2145 { # insert processing code here
2146 }
2147
2148 prev_elt ($optional_condition)
2149 Return the previous elt (optionally matching $optional_condition)
2150 of the element. This is the first element which opens before the
2151 current one. It is usually either the last descendant of the
2152 previous sibling or simply the parent
2153
2154 next_n_elt ($offset, $optional_condition)
2155 Return the $offset-th element that matches the $optional_condition
2156
2157 following_elt
2158 Return the following element (as per the XPath following axis)
2159
2160 preceding_elt
2161 Return the preceding element (as per the XPath preceding axis)
2162
2163 following_elts
2164 Return the list of following elements (as per the XPath following
2165 axis)
2166
2167 preceding_elts
2168 Return the list of preceding elements (as per the XPath preceding
2169 axis)
2170
2171 children ($optional_condition)
2172 Return the list of children (optionally which matches
2173 $optional_condition) of the element. The list is in document order.
2174
2175 children_count ($optional_condition)
2176 Return the number of children of the element (optionally which
2177 matches $optional_condition)
2178
2179 children_text ($optional_condition)
2180 In array context, returns an array containing the text of children
2181 of the element (optionally which matches $optional_condition)
2182
2183 In scalar context, returns the concatenation of the text of
2184 children of the element
2185
2186 children_trimmed_text ($optional_condition)
2187 In array context, returns an array containing the trimmed text of
2188 children of the element (optionally which matches
2189 $optional_condition)
2190
2191 In scalar context, returns the concatenation of the trimmed text of
2192 children of the element
2193
2194 children_copy ($optional_condition)
2195 Return a list of elements that are copies of the children of the
2196 element, optionally which matches $optional_condition
2197
2198 descendants ($optional_condition)
2199 Return the list of all descendants (optionally which matches
2200 $optional_condition) of the element. This is the equivalent of the
2201 "getElementsByTagName" of the DOM (by the way, if you are really a
2202 DOM addict, you can use "getElementsByTagName" instead)
2203
2204 getElementsByTagName ($optional_condition)
2205 Same as "descendants"
2206
2207 find_by_tag_name ($optional_condition)
2208 Same as "descendants"
2209
2210 descendants_or_self ($optional_condition)
2211 Same as "descendants" except that the element itself is included in
2212 the list if it matches the $optional_condition
2213
2214 first_descendant ($optional_condition)
2215 Return the first descendant of the element that matches the
2216 condition
2217
2218 last_descendant ($optional_condition)
2219 Return the last descendant of the element that matches the
2220 condition
2221
2222 ancestors ($optional_condition)
2223 Return the list of ancestors (optionally matching
2224 $optional_condition) of the element. The list is ordered from the
2225 innermost ancestor to the outermost one
2226
2227 NOTE: the element itself is not part of the list, in order to
2228 include it you will have to use ancestors_or_self
2229
2230 ancestors_or_self ($optional_condition)
2231 Return the list of ancestors (optionally matching
2232 $optional_condition) of the element, including the element (if it
2233 matches the condition>). The list is ordered from the innermost
2234 ancestor to the outermost one
2235
2236 passes ($condition)
2237 Return the element if it passes the $condition
2238
2239 att ($att)
2240 Return the value of attribute $att or "undef"
2241
2242 latt ($att)
2243 Return the value of attribute $att or "undef"
2244
2245 this method is an lvalue, so you can do "$elt->latt( 'foo')= 'bar'"
2246 or "$elt->latt( 'foo')++;"
2247
2248 set_att ($att, $att_value)
2249 Set the attribute of the element to the given value
2250
2251 You can actually set several attributes this way:
2252
2253 $elt->set_att( att1 => "val1", att2 => "val2");
2254
2255 del_att ($att)
2256 Delete the attribute for the element
2257
2258 You can actually delete several attributes at once:
2259
2260 $elt->del_att( 'att1', 'att2', 'att3');
2261
2262 att_exists ($att)
2263 Returns true if the attribute $att exists for the element, false
2264 otherwise
2265
2266 cut Cut the element from the tree. The element still exists, it can be
2267 copied or pasted somewhere else, it is just not attached to the
2268 tree anymore.
2269
2270 Note that the "old" links to the parent, previous and next siblings
2271 can still be accessed using the former_* methods
2272
2273 former_next_sibling
2274 Returns the former next sibling of a cut node (or undef if the node
2275 has not been cut)
2276
2277 This makes it easier to write loops where you cut elements:
2278
2279 my $child= $parent->first_child( 'achild');
2280 while( $child->{'att'}->{'cut'})
2281 { $child->cut; $child= ($child->{former} && $child->{former}->{next_sibling}); }
2282
2283 former_prev_sibling
2284 Returns the former previous sibling of a cut node (or undef if the
2285 node has not been cut)
2286
2287 former_parent
2288 Returns the former parent of a cut node (or undef if the node has
2289 not been cut)
2290
2291 cut_children ($optional_condition)
2292 Cut all the children of the element (or all of those which satisfy
2293 the $optional_condition).
2294
2295 Return the list of children
2296
2297 cut_descendants ($optional_condition)
2298 Cut all the descendants of the element (or all of those which
2299 satisfy the $optional_condition).
2300
2301 Return the list of descendants
2302
2303 copy ($elt)
2304 Return a copy of the element. The copy is a "deep" copy: all sub-
2305 elements of the element are duplicated.
2306
2307 paste ($optional_position, $ref)
2308 Paste a (previously "cut" or newly generated) element. Die if the
2309 element already belongs to a tree.
2310
2311 Note that the calling element is pasted:
2312
2313 $child->paste( first_child => $existing_parent);
2314 $new_sibling->paste( after => $this_sibling_is_already_in_the_tree);
2315
2316 or
2317
2318 my $new_elt= XML::Twig::Elt->new( tag => $content);
2319 $new_elt->paste( $position => $existing_elt);
2320
2321 Example:
2322
2323 my $t= XML::Twig->new->parse( 'doc.xml')
2324 my $toc= $t->root->new( 'toc');
2325 $toc->paste( $t->root); # $toc is pasted as first child of the root
2326 foreach my $title ($t->findnodes( '/doc/section/title'))
2327 { my $title_toc= $title->copy;
2328 # paste $title_toc as the last child of toc
2329 $title_toc->paste( last_child => $toc)
2330 }
2331
2332 Position options:
2333
2334 first_child (default)
2335 The element is pasted as the first child of $ref
2336
2337 last_child
2338 The element is pasted as the last child of $ref
2339
2340 before
2341 The element is pasted before $ref, as its previous sibling.
2342
2343 after
2344 The element is pasted after $ref, as its next sibling.
2345
2346 within
2347 In this case an extra argument, $offset, should be supplied.
2348 The element will be pasted in the reference element (or in its
2349 first text child) at the given offset. To achieve this the
2350 reference element will be split at the offset.
2351
2352 Note that you can call directly the underlying method:
2353
2354 paste_before
2355 paste_after
2356 paste_first_child
2357 paste_last_child
2358 paste_within
2359 move ($optional_position, $ref)
2360 Move an element in the tree. This is just a "cut" then a "paste".
2361 The syntax is the same as "paste".
2362
2363 replace ($ref)
2364 Replaces an element in the tree. Sometimes it is just not possible
2365 to"cut" an element then "paste" another in its place, so "replace"
2366 comes in handy. The calling element replaces $ref.
2367
2368 replace_with (@elts)
2369 Replaces the calling element with one or more elements
2370
2371 delete
2372 Cut the element and frees the memory.
2373
2374 prefix ($text, $optional_option)
2375 Add a prefix to an element. If the element is a "PCDATA" element
2376 the text is added to the pcdata, if the elements first child is a
2377 "PCDATA" then the text is added to it's pcdata, otherwise a new
2378 "PCDATA" element is created and pasted as the first child of the
2379 element.
2380
2381 If the option is "asis" then the prefix is added asis: it is
2382 created in a separate "PCDATA" element with an "asis" property. You
2383 can then write:
2384
2385 $elt1->prefix( '<b>', 'asis');
2386
2387 to create a "<b>" in the output of "print".
2388
2389 suffix ($text, $optional_option)
2390 Add a suffix to an element. If the element is a "PCDATA" element
2391 the text is added to the pcdata, if the elements last child is a
2392 "PCDATA" then the text is added to it's pcdata, otherwise a new
2393 PCDATA element is created and pasted as the last child of the
2394 element.
2395
2396 If the option is "asis" then the suffix is added asis: it is
2397 created in a separate "PCDATA" element with an "asis" property. You
2398 can then write:
2399
2400 $elt2->suffix( '</b>', 'asis');
2401
2402 trim
2403 Trim the element in-place: spaces at the beginning and at the end
2404 of the element are discarded and multiple spaces within the element
2405 (or its descendants) are replaced by a single space.
2406
2407 Note that in some cases you can still end up with multiple spaces,
2408 if they are split between several elements:
2409
2410 <doc> text <b> hah! </b> yep</doc>
2411
2412 gets trimmed to
2413
2414 <doc>text <b> hah! </b> yep</doc>
2415
2416 This is somewhere in between a bug and a feature.
2417
2418 normalize
2419 merge together all consecutive pcdata elements in the element (if
2420 for example you have turned some elements into pcdata using
2421 "erase", this will give you a "clean" element in which there all
2422 text fragments are as long as possible).
2423
2424 simplify (%options)
2425 Return a data structure suspiciously similar to XML::Simple's.
2426 Options are identical to XMLin options, see XML::Simple doc for
2427 more details (or use DATA::dumper or YAML to dump the data
2428 structure)
2429
2430 Note: there is no magic here, if you write "$twig->parsefile( $file
2431 )->simplify();" then it will load the entire document in memory. I
2432 am afraid you will have to put some work into it to get just the
2433 bits you want and discard the rest. Look at the synopsis or the
2434 XML::Twig 101 section at the top of the docs for more information.
2435
2436 content_key
2437 forcearray
2438 keyattr
2439 noattr
2440 normalize_space
2441 aka normalise_space
2442
2443 variables (%var_hash)
2444 %var_hash is a hash { name => value }
2445
2446 This option allows variables in the XML to be expanded when the
2447 file is read. (there is no facility for putting the variable
2448 names back if you regenerate XML using XMLout).
2449
2450 A 'variable' is any text of the form ${name} (or $name) which
2451 occurs in an attribute value or in the text content of an
2452 element. If 'name' matches a key in the supplied hashref,
2453 ${name} will be replaced with the corresponding value from the
2454 hashref. If no matching key is found, the variable will not be
2455 replaced.
2456
2457 var_att ($attribute_name)
2458 This option gives the name of an attribute that will be used to
2459 create variables in the XML:
2460
2461 <dirs>
2462 <dir name="prefix">/usr/local</dir>
2463 <dir name="exec_prefix">$prefix/bin</dir>
2464 </dirs>
2465
2466 use "var => 'name'" to get $prefix replaced by /usr/local in
2467 the generated data structure
2468
2469 By default variables are captured by the following regexp:
2470 /$(\w+)/
2471
2472 var_regexp (regexp)
2473 This option changes the regexp used to capture variables. The
2474 variable name should be in $1
2475
2476 group_tags { grouping tag => grouped tag, grouping tag 2 => grouped
2477 tag 2...}
2478 Option used to simplify the structure: elements listed will not
2479 be used. Their children will be, they will be considered
2480 children of the element parent.
2481
2482 If the element is:
2483
2484 <config host="laptop.xmltwig.org">
2485 <server>localhost</server>
2486 <dirs>
2487 <dir name="base">/home/mrodrigu/standards</dir>
2488 <dir name="tools">$base/tools</dir>
2489 </dirs>
2490 <templates>
2491 <template name="std_def">std_def.templ</template>
2492 <template name="dummy">dummy</template>
2493 </templates>
2494 </config>
2495
2496 Then calling simplify with "group_tags => { dirs => 'dir',
2497 templates => 'template'}" makes the data structure be exactly
2498 as if the start and end tags for "dirs" and "templates" were
2499 not there.
2500
2501 A YAML dump of the structure
2502
2503 base: '/home/mrodrigu/standards'
2504 host: laptop.xmltwig.org
2505 server: localhost
2506 template:
2507 - std_def.templ
2508 - dummy.templ
2509 tools: '$base/tools'
2510
2511 split_at ($offset)
2512 Split a text ("PCDATA" or "CDATA") element in 2 at $offset, the
2513 original element now holds the first part of the string and a new
2514 element holds the right part. The new element is returned
2515
2516 If the element is not a text element then the first text child of
2517 the element is split
2518
2519 split ( $optional_regexp, $tag1, $atts1, $tag2, $atts2...)
2520 Split the text descendants of an element in place, the text is
2521 split using the $regexp, if the regexp includes () then the matched
2522 separators will be wrapped in elements. $1 is wrapped in $tag1,
2523 with attributes $atts1 if $atts1 is given (as a hashref), $2 is
2524 wrapped in $tag2...
2525
2526 if $elt is "<p>tati tata <b>tutu tati titi</b> tata tati tata</p>"
2527
2528 $elt->split( qr/(ta)ti/, 'foo', {type => 'toto'} )
2529
2530 will change $elt to
2531
2532 <p><foo type="toto">ta</foo> tata <b>tutu <foo type="toto">ta</foo>
2533 titi</b> tata <foo type="toto">ta</foo> tata</p>
2534
2535 The regexp can be passed either as a string or as "qr//" (perl
2536 5.005 and later), it defaults to \s+ just as the "split" built-in
2537 (but this would be quite a useless behaviour without the
2538 $optional_tag parameter)
2539
2540 $optional_tag defaults to PCDATA or CDATA, depending on the initial
2541 element type
2542
2543 The list of descendants is returned (including un-touched original
2544 elements and newly created ones)
2545
2546 mark ( $regexp, $optional_tag, $optional_attribute_ref)
2547 This method behaves exactly as split, except only the newly created
2548 elements are returned
2549
2550 wrap_children ( $regexp_string, $tag, $optional_attribute_hashref)
2551 Wrap the children of the element that match the regexp in an
2552 element $tag. If $optional_attribute_hashref is passed then the
2553 new element will have these attributes.
2554
2555 The $regexp_string includes tags, within pointy brackets, as in
2556 "<title><para>+" and the usual Perl modifiers (+*?...). Tags can
2557 be further qualified with attributes: "<para type="warning"
2558 classif="cosmic_secret">+". The values for attributes should be
2559 xml-escaped: "<candy type="M&Ms">*" ("<", "&" ">" and """
2560 should be escaped).
2561
2562 Note that elements might get extra "id" attributes in the process.
2563 See add_id. Use strip_att to remove unwanted id's.
2564
2565 Here is an example:
2566
2567 If the element $elt has the following content:
2568
2569 <elt>
2570 <p>para 1</p>
2571 <l_l1_1>list 1 item 1 para 1</l_l1_1>
2572 <l_l1>list 1 item 1 para 2</l_l1>
2573 <l_l1_n>list 1 item 2 para 1 (only para)</l_l1_n>
2574 <l_l1_n>list 1 item 3 para 1</l_l1_n>
2575 <l_l1>list 1 item 3 para 2</l_l1>
2576 <l_l1>list 1 item 3 para 3</l_l1>
2577 <l_l1_1>list 2 item 1 para 1</l_l1_1>
2578 <l_l1>list 2 item 1 para 2</l_l1>
2579 <l_l1_n>list 2 item 2 para 1 (only para)</l_l1_n>
2580 <l_l1_n>list 2 item 3 para 1</l_l1_n>
2581 <l_l1>list 2 item 3 para 2</l_l1>
2582 <l_l1>list 2 item 3 para 3</l_l1>
2583 </elt>
2584
2585 Then the code
2586
2587 $elt->wrap_children( q{<l_l1_1><l_l1>*} , li => { type => "ul1" });
2588 $elt->wrap_children( q{<l_l1_n><l_l1>*} , li => { type => "ul" });
2589
2590 $elt->wrap_children( q{<li type="ul1"><li type="ul">+}, "ul");
2591 $elt->strip_att( 'id');
2592 $elt->strip_att( 'type');
2593 $elt->print;
2594
2595 will output:
2596
2597 <elt>
2598 <p>para 1</p>
2599 <ul>
2600 <li>
2601 <l_l1_1>list 1 item 1 para 1</l_l1_1>
2602 <l_l1>list 1 item 1 para 2</l_l1>
2603 </li>
2604 <li>
2605 <l_l1_n>list 1 item 2 para 1 (only para)</l_l1_n>
2606 </li>
2607 <li>
2608 <l_l1_n>list 1 item 3 para 1</l_l1_n>
2609 <l_l1>list 1 item 3 para 2</l_l1>
2610 <l_l1>list 1 item 3 para 3</l_l1>
2611 </li>
2612 </ul>
2613 <ul>
2614 <li>
2615 <l_l1_1>list 2 item 1 para 1</l_l1_1>
2616 <l_l1>list 2 item 1 para 2</l_l1>
2617 </li>
2618 <li>
2619 <l_l1_n>list 2 item 2 para 1 (only para)</l_l1_n>
2620 </li>
2621 <li>
2622 <l_l1_n>list 2 item 3 para 1</l_l1_n>
2623 <l_l1>list 2 item 3 para 2</l_l1>
2624 <l_l1>list 2 item 3 para 3</l_l1>
2625 </li>
2626 </ul>
2627 </elt>
2628
2629 subs_text ($regexp, $replace)
2630 subs_text does text substitution, similar to perl's " s///"
2631 operator.
2632
2633 $regexp must be a perl regexp, created with the "qr" operator.
2634
2635 $replace can include "$1, $2"... from the $regexp. It can also be
2636 used to create element and entities, by using "&elt( tag => { att
2637 => val }, text)" (similar syntax as "new") and "&ent( name)".
2638
2639 Here is a rather complex example:
2640
2641 $elt->subs_text( qr{(?<!do not )link to (http://([^\s,]*))},
2642 'see &elt( a =>{ href => $1 }, $2)'
2643 );
2644
2645 This will replace text like link to http://www.xmltwig.org by see
2646 <a href="www.xmltwig.org">www.xmltwig.org</a>, but not do not link
2647 to...
2648
2649 Generating entities (here replacing spaces with ):
2650
2651 $elt->subs_text( qr{ }, '&ent( " ")');
2652
2653 or, using a variable:
2654
2655 my $ent=" ";
2656 $elt->subs_text( qr{ }, "&ent( '$ent')");
2657
2658 Note that the substitution is always global, as in using the "g"
2659 modifier in a perl substitution, and that it is performed on all
2660 text descendants of the element.
2661
2662 Bug: in the $regexp, you can only use "\1", "\2"... if the
2663 replacement expression does not include elements or attributes. eg
2664
2665 $t->subs_text( qr/((t[aiou])\2)/, '$2'); # ok, replaces toto, tata, titi, tutu by to, ta, ti, tu
2666 $t->subs_text( qr/((t[aiou])\2)/, '&elt(p => $1)' ); # NOK, does not find toto...
2667
2668 add_id ($optional_coderef)
2669 Add an id to the element.
2670
2671 The id is an attribute, "id" by default, see the "id" option for
2672 XML::Twig "new" to change it. Use an id starting with "#" to get an
2673 id that's not output by print, flush or sprint, yet that allows you
2674 to use the elt_id method to get the element easily.
2675
2676 If the element already has an id, no new id is generated.
2677
2678 By default the method create an id of the form "twig_id_<nnnn>",
2679 where "<nnnn>" is a number, incremented each time the method is
2680 called successfully.
2681
2682 set_id_seed ($prefix)
2683 by default the id generated by "add_id" is "twig_id_<nnnn>",
2684 "set_id_seed" changes the prefix to $prefix and resets the number
2685 to 1
2686
2687 strip_att ($att)
2688 Remove the attribute $att from all descendants of the element
2689 (including the element)
2690
2691 Return the element
2692
2693 change_att_name ($old_name, $new_name)
2694 Change the name of the attribute from $old_name to $new_name. If
2695 there is no attribute $old_name nothing happens.
2696
2697 lc_attnames
2698 Lower cases the name all the attributes of the element.
2699
2700 sort_children_on_value( %options)
2701 Sort the children of the element in place according to their text.
2702 All children are sorted.
2703
2704 Return the element, with its children sorted.
2705
2706 %options are
2707
2708 type : numeric | alpha (default: alpha)
2709 order : normal | reverse (default: normal)
2710
2711 Return the element, with its children sorted
2712
2713 sort_children_on_att ($att, %options)
2714 Sort the children of the element in place according to attribute
2715 $att. %options are the same as for "sort_children_on_value"
2716
2717 Return the element.
2718
2719 sort_children_on_field ($tag, %options)
2720 Sort the children of the element in place, according to the field
2721 $tag (the text of the first child of the child with this tag).
2722 %options are the same as for "sort_children_on_value".
2723
2724 Return the element, with its children sorted
2725
2726 sort_children( $get_key, %options)
2727 Sort the children of the element in place. The $get_key argument is
2728 a reference to a function that returns the sort key when passed an
2729 element.
2730
2731 For example:
2732
2733 $elt->sort_children( sub { $_[0]->{'att'}->{"nb"} + $_[0]->text },
2734 type => 'numeric', order => 'reverse'
2735 );
2736
2737 field_to_att ($cond, $att)
2738 Turn the text of the first sub-element matched by $cond into the
2739 value of attribute $att of the element. If $att is omitted then
2740 $cond is used as the name of the attribute, which makes sense only
2741 if $cond is a valid element (and attribute) name.
2742
2743 The sub-element is then cut.
2744
2745 att_to_field ($att, $tag)
2746 Take the value of attribute $att and create a sub-element $tag as
2747 first child of the element. If $tag is omitted then $att is used as
2748 the name of the sub-element.
2749
2750 get_xpath ($xpath, $optional_offset)
2751 Return a list of elements satisfying the $xpath. $xpath is an
2752 XPATH-like expression.
2753
2754 A subset of the XPATH abbreviated syntax is covered:
2755
2756 tag
2757 tag[1] (or any other positive number)
2758 tag[last()]
2759 tag[@att] (the attribute exists for the element)
2760 tag[@att="val"]
2761 tag[@att=~ /regexp/]
2762 tag[att1="val1" and att2="val2"]
2763 tag[att1="val1" or att2="val2"]
2764 tag[string()="toto"] (returns tag elements which text (as per the text method)
2765 is toto)
2766 tag[string()=~/regexp/] (returns tag elements which text (as per the text
2767 method) matches regexp)
2768 expressions can start with / (search starts at the document root)
2769 expressions can start with . (search starts at the current element)
2770 // can be used to get all descendants instead of just direct children
2771 * matches any tag
2772
2773 So the following examples from the XPath
2774 recommendation<http://www.w3.org/TR/xpath.html#path-abbrev> work:
2775
2776 para selects the para element children of the context node
2777 * selects all element children of the context node
2778 para[1] selects the first para child of the context node
2779 para[last()] selects the last para child of the context node
2780 */para selects all para grandchildren of the context node
2781 /doc/chapter[5]/section[2] selects the second section of the fifth chapter
2782 of the doc
2783 chapter//para selects the para element descendants of the chapter element
2784 children of the context node
2785 //para selects all the para descendants of the document root and thus selects
2786 all para elements in the same document as the context node
2787 //olist/item selects all the item elements in the same document as the
2788 context node that have an olist parent
2789 .//para selects the para element descendants of the context node
2790 .. selects the parent of the context node
2791 para[@type="warning"] selects all para children of the context node that have
2792 a type attribute with value warning
2793 employee[@secretary and @assistant] selects all the employee children of the
2794 context node that have both a secretary attribute and an assistant
2795 attribute
2796
2797 The elements will be returned in the document order.
2798
2799 If $optional_offset is used then only one element will be returned,
2800 the one with the appropriate offset in the list, starting at 0
2801
2802 Quoting and interpolating variables can be a pain when the Perl
2803 syntax and the XPATH syntax collide, so use alternate quoting
2804 mechanisms like q or qq (I like q{} and qq{} myself).
2805
2806 Here are some more examples to get you started:
2807
2808 my $p1= "p1";
2809 my $p2= "p2";
2810 my @res= $t->get_xpath( qq{p[string( "$p1") or string( "$p2")]});
2811
2812 my $a= "a1";
2813 my @res= $t->get_xpath( qq{//*[@att="$a"]});
2814
2815 my $val= "a1";
2816 my $exp= qq{//p[ \@att='$val']}; # you need to use \@ or you will get a warning
2817 my @res= $t->get_xpath( $exp);
2818
2819 Note that the only supported regexps delimiters are / and that you
2820 must backslash all / in regexps AND in regular strings.
2821
2822 XML::Twig does not provide natively full XPATH support, but you can
2823 use "XML::Twig::XPath" to get "findnodes" to use "XML::XPath" as
2824 the XPath engine, with full coverage of the spec.
2825
2826 "XML::Twig::XPath" to get "findnodes" to use "XML::XPath" as the
2827 XPath engine, with full coverage of the spec.
2828
2829 find_nodes
2830 same as"get_xpath"
2831
2832 findnodes
2833 same as "get_xpath"
2834
2835 text @optional_options
2836 Return a string consisting of all the "PCDATA" and "CDATA" in an
2837 element, without any tags. The text is not XML-escaped: base
2838 entities such as "&" and "<" are not escaped.
2839
2840 The '"no_recurse"' option will only return the text of the element,
2841 not of any included sub-elements (same as "text_only").
2842
2843 text_only
2844 Same as "text" except that the text returned doesn't include the
2845 text of sub-elements.
2846
2847 trimmed_text
2848 Same as "text" except that the text is trimmed: leading and
2849 trailing spaces are discarded, consecutive spaces are collapsed
2850
2851 set_text ($string)
2852 Set the text for the element: if the element is a "PCDATA", just
2853 set its text, otherwise cut all the children of the element and
2854 create a single "PCDATA" child for it, which holds the text.
2855
2856 merge ($elt2)
2857 Move the content of $elt2 within the element
2858
2859 insert ($tag1, [$optional_atts1], $tag2, [$optional_atts2],...)
2860 For each tag in the list inserts an element $tag as the only child
2861 of the element. The element gets the optional attributes
2862 in"$optional_atts<n>." All children of the element are set as
2863 children of the new element. The upper level element is returned.
2864
2865 $p->insert( table => { border=> 1}, 'tr', 'td')
2866
2867 put $p in a table with a visible border, a single "tr" and a single
2868 "td" and return the "table" element:
2869
2870 <p><table border="1"><tr><td>original content of p</td></tr></table></p>
2871
2872 wrap_in (@tag)
2873 Wrap elements in @tag as the successive ancestors of the element,
2874 returns the new element. "$elt->wrap_in( 'td', 'tr', 'table')"
2875 wraps the element as a single cell in a table for example.
2876
2877 Optionally each tag can be followed by a hashref of attributes,
2878 that will be set on the wrapping element:
2879
2880 $elt->wrap_in( p => { class => "advisory" }, div => { class => "intro", id => "div_intro" });
2881
2882 insert_new_elt ($opt_position, $tag, $opt_atts_hashref, @opt_content)
2883 Combines a "new " and a "paste ": creates a new element using $tag,
2884 $opt_atts_hashref and @opt_content which are arguments similar to
2885 those for "new", then paste it, using $opt_position or
2886 'first_child', relative to $elt.
2887
2888 Return the newly created element
2889
2890 erase
2891 Erase the element: the element is deleted and all of its children
2892 are pasted in its place.
2893
2894 set_content ( $optional_atts, @list_of_elt_and_strings) (
2895 $optional_atts, '#EMPTY')
2896 Set the content for the element, from a list of strings and
2897 elements. Cuts all the element children, then pastes the list
2898 elements as the children. This method will create a "PCDATA"
2899 element for any strings in the list.
2900
2901 The $optional_atts argument is the ref of a hash of attributes. If
2902 this argument is used then the previous attributes are deleted,
2903 otherwise they are left untouched.
2904
2905 WARNING: if you rely on ID's then you will have to set the id
2906 yourself. At this point the element does not belong to a twig yet,
2907 so the ID attribute is not known so it won't be stored in the ID
2908 list.
2909
2910 A content of '"#EMPTY"' creates an empty element;
2911
2912 namespace ($optional_prefix)
2913 Return the URI of the namespace that $optional_prefix or the
2914 element name belongs to. If the name doesn't belong to any
2915 namespace, "undef" is returned.
2916
2917 local_name
2918 Return the local name (without the prefix) for the element
2919
2920 ns_prefix
2921 Return the namespace prefix for the element
2922
2923 current_ns_prefixes
2924 Return a list of namespace prefixes valid for the element. The
2925 order of the prefixes in the list has no meaning. If the default
2926 namespace is currently bound, '' appears in the list.
2927
2928 inherit_att ($att, @optional_tag_list)
2929 Return the value of an attribute inherited from parent tags. The
2930 value returned is found by looking for the attribute in the element
2931 then in turn in each of its ancestors. If the @optional_tag_list is
2932 supplied only those ancestors whose tag is in the list will be
2933 checked.
2934
2935 all_children_are ($optional_condition)
2936 return 1 if all children of the element pass the
2937 $optional_condition, 0 otherwise
2938
2939 level ($optional_condition)
2940 Return the depth of the element in the twig (root is 0). If
2941 $optional_condition is given then only ancestors that match the
2942 condition are counted.
2943
2944 WARNING: in a tree created using the "twig_roots" option this will
2945 not return the level in the document tree, level 0 will be the
2946 document root, level 1 will be the "twig_roots" elements. During
2947 the parsing (in a "twig_handler") you can use the "depth" method on
2948 the twig object to get the real parsing depth.
2949
2950 in ($potential_parent)
2951 Return true if the element is in the potential_parent
2952 ($potential_parent is an element)
2953
2954 in_context ($cond, $optional_level)
2955 Return true if the element is included in an element which passes
2956 $cond optionally within $optional_level levels. The returned value
2957 is the including element.
2958
2959 pcdata
2960 Return the text of a "PCDATA" element or "undef" if the element is
2961 not "PCDATA".
2962
2963 pcdata_xml_string
2964 Return the text of a "PCDATA" element or undef if the element is
2965 not "PCDATA". The text is "XML-escaped" ('&' and '<' are replaced
2966 by '&' and '<')
2967
2968 set_pcdata ($text)
2969 Set the text of a "PCDATA" element. This method does not check that
2970 the element is indeed a "PCDATA" so usually you should use
2971 "set_text" instead.
2972
2973 append_pcdata ($text)
2974 Add the text at the end of a "PCDATA" element.
2975
2976 is_cdata
2977 Return 1 if the element is a "CDATA" element, returns 0 otherwise.
2978
2979 is_text
2980 Return 1 if the element is a "CDATA" or "PCDATA" element, returns 0
2981 otherwise.
2982
2983 cdata
2984 Return the text of a "CDATA" element or "undef" if the element is
2985 not "CDATA".
2986
2987 cdata_string
2988 Return the XML string of a "CDATA" element, including the opening
2989 and closing markers.
2990
2991 set_cdata ($text)
2992 Set the text of a "CDATA" element.
2993
2994 append_cdata ($text)
2995 Add the text at the end of a "CDATA" element.
2996
2997 remove_cdata
2998 Turns all "CDATA" sections in the element into regular "PCDATA"
2999 elements. This is useful when converting XML to HTML, as browsers
3000 do not support CDATA sections.
3001
3002 extra_data
3003 Return the extra_data (comments and PI's) attached to an element
3004
3005 set_extra_data ($extra_data)
3006 Set the extra_data (comments and PI's) attached to an element
3007
3008 append_extra_data ($extra_data)
3009 Append extra_data to the existing extra_data before the element (if
3010 no previous extra_data exists then it is created)
3011
3012 set_asis
3013 Set a property of the element that causes it to be output without
3014 being XML escaped by the print functions: if it contains "a < b" it
3015 will be output as such and not as "a < b". This can be useful to
3016 create text elements that will be output as markup. Note that all
3017 "PCDATA" descendants of the element are also marked as having the
3018 property (they are the ones that are actually impacted by the
3019 change).
3020
3021 If the element is a "CDATA" element it will also be output asis,
3022 without the "CDATA" markers. The same goes for any "CDATA"
3023 descendant of the element
3024
3025 set_not_asis
3026 Unsets the "asis" property for the element and its text
3027 descendants.
3028
3029 is_asis
3030 Return the "asis" property status of the element ( 1 or "undef")
3031
3032 closed
3033 Return true if the element has been closed. Might be useful if you
3034 are somewhere in the tree, during the parse, and have no idea
3035 whether a parent element is completely loaded or not.
3036
3037 get_type
3038 Return the type of the element: '"#ELT"' for "real" elements, or
3039 '"#PCDATA"', '"#CDATA"', '"#COMMENT"', '"#ENT"', '"#PI"'
3040
3041 is_elt
3042 Return the tag if the element is a "real" element, or 0 if it is
3043 "PCDATA", "CDATA"...
3044
3045 contains_only_text
3046 Return 1 if the element does not contain any other "real" element
3047
3048 contains_only ($exp)
3049 Return the list of children if all children of the element match
3050 the expression $exp
3051
3052 if( $para->contains_only( 'tt')) { ... }
3053
3054 contains_a_single ($exp)
3055 If the element contains a single child that matches the expression
3056 $exp returns that element. Otherwise returns 0.
3057
3058 is_field
3059 same as "contains_only_text"
3060
3061 is_pcdata
3062 Return 1 if the element is a "PCDATA" element, returns 0 otherwise.
3063
3064 is_ent
3065 Return 1 if the element is an entity (an unexpanded entity)
3066 element, return 0 otherwise.
3067
3068 is_empty
3069 Return 1 if the element is empty, 0 otherwise
3070
3071 set_empty
3072 Flags the element as empty. No further check is made, so if the
3073 element is actually not empty the output will be messed. The only
3074 effect of this method is that the output will be "<tag
3075 att="value""/>".
3076
3077 set_not_empty
3078 Flags the element as not empty. if it is actually empty then the
3079 element will be output as "<tag att="value""></tag>"
3080
3081 is_pi
3082 Return 1 if the element is a processing instruction ("#PI")
3083 element, return 0 otherwise.
3084
3085 target
3086 Return the target of a processing instruction
3087
3088 set_target ($target)
3089 Set the target of a processing instruction
3090
3091 data
3092 Return the data part of a processing instruction
3093
3094 set_data ($data)
3095 Set the data of a processing instruction
3096
3097 set_pi ($target, $data)
3098 Set the target and data of a processing instruction
3099
3100 pi_string
3101 Return the string form of a processing instruction ("<?target
3102 data?>")
3103
3104 is_comment
3105 Return 1 if the element is a comment ("#COMMENT") element, return 0
3106 otherwise.
3107
3108 set_comment ($comment_text)
3109 Set the text for a comment
3110
3111 comment
3112 Return the content of a comment (just the text, not the "<!--" and
3113 "-->")
3114
3115 comment_string
3116 Return the XML string for a comment ("<!-- comment -->")
3117
3118 Note that an XML comment cannot start or end with a '-', or include
3119 '--' (http://www.w3.org/TR/2008/REC-xml-20081126/#sec-comments), if
3120 that is the case (because you have created the comment yourself
3121 presumably, as it could not be in the input XML), then a space will
3122 be inserted before an initial '-', after a trailing one or between
3123 two '-' in the comment (which could presumably mangle javascript
3124 "hidden" in an XHTML comment);
3125
3126 set_ent ($entity)
3127 Set an (non-expanded) entity ("#ENT"). $entity) is the entity text
3128 ("&ent;")
3129
3130 ent Return the entity for an entity ("#ENT") element ("&ent;")
3131
3132 ent_name
3133 Return the entity name for an entity ("#ENT") element ("ent")
3134
3135 ent_string
3136 Return the entity, either expanded if the expanded version is
3137 available, or non-expanded ("&ent;") otherwise
3138
3139 child ($offset, $optional_condition)
3140 Return the $offset-th child of the element, optionally the
3141 $offset-th child that matches $optional_condition. The children are
3142 treated as a list, so "$elt->child( 0)" is the first child, while
3143 "$elt->child( -1)" is the last child.
3144
3145 child_text ($offset, $optional_condition)
3146 Return the text of a child or "undef" if the sibling does not
3147 exist. Arguments are the same as child.
3148
3149 last_child ($optional_condition)
3150 Return the last child of the element, or the last child matching
3151 $optional_condition (ie the last of the element children matching
3152 the condition).
3153
3154 last_child_text ($optional_condition)
3155 Same as "first_child_text" but for the last child.
3156
3157 sibling ($offset, $optional_condition)
3158 Return the next or previous $offset-th sibling of the element, or
3159 the $offset-th one matching $optional_condition. If $offset is
3160 negative then a previous sibling is returned, if $offset is
3161 positive then a next sibling is returned. "$offset=0" returns the
3162 element if there is no condition or if the element matches the
3163 condition>, "undef" otherwise.
3164
3165 sibling_text ($offset, $optional_condition)
3166 Return the text of a sibling or "undef" if the sibling does not
3167 exist. Arguments are the same as "sibling".
3168
3169 prev_siblings ($optional_condition)
3170 Return the list of previous siblings (optionally matching
3171 $optional_condition) for the element. The elements are ordered in
3172 document order.
3173
3174 next_siblings ($optional_condition)
3175 Return the list of siblings (optionally matching
3176 $optional_condition) following the element. The elements are
3177 ordered in document order.
3178
3179 siblings ($optional_condition)
3180 Return the list of siblings (optionally matching
3181 $optional_condition) of the element (excluding the element itself).
3182 The elements are ordered in document order.
3183
3184 pos ($optional_condition)
3185 Return the position of the element in the children list. The first
3186 child has a position of 1 (as in XPath).
3187
3188 If the $optional_condition is given then only siblings that match
3189 the condition are counted. If the element itself does not match the
3190 condition then 0 is returned.
3191
3192 atts
3193 Return a hash ref containing the element attributes
3194
3195 set_atts ({ att1=>$att1_val, att2=> $att2_val... })
3196 Set the element attributes with the hash ref supplied as the
3197 argument. The previous attributes are lost (ie the attributes set
3198 by "set_atts" replace all of the attributes of the element).
3199
3200 You can also pass a list instead of a hashref: "$elt->set_atts(
3201 att1 => 'val1',...)"
3202
3203 del_atts
3204 Deletes all the element attributes.
3205
3206 att_nb
3207 Return the number of attributes for the element
3208
3209 has_atts
3210 Return true if the element has attributes (in fact return the
3211 number of attributes, thus being an alias to "att_nb"
3212
3213 has_no_atts
3214 Return true if the element has no attributes, false (0) otherwise
3215
3216 att_names
3217 return a list of the attribute names for the element
3218
3219 att_xml_string ($att, $options)
3220 Return the attribute value, where '&', '<' and quote (" or the
3221 value of the quote option at twig creation) are XML-escaped.
3222
3223 The options are passed as a hashref, setting "escape_gt" to a true
3224 value will also escape '>' ($elt( 'myatt', { escape_gt => 1 });
3225
3226 set_id ($id)
3227 Set the "id" attribute of the element to the value. See "elt_id "
3228 to change the id attribute name
3229
3230 id Gets the id attribute value
3231
3232 del_id ($id)
3233 Deletes the "id" attribute of the element and remove it from the id
3234 list for the document
3235
3236 class
3237 Return the "class" attribute for the element (methods on the
3238 "class" attribute are quite convenient when dealing with XHTML, or
3239 plain XML that will eventually be displayed using CSS)
3240
3241 lclass
3242 same as class, except that this method is an lvalue, so you can do
3243 "$elt->lclass= "foo""
3244
3245 set_class ($class)
3246 Set the "class" attribute for the element to $class
3247
3248 add_class ($class)
3249 Add $class to the element "class" attribute: the new class is added
3250 only if it is not already present.
3251
3252 Note that classes are then sorted alphabetically, so the "class"
3253 attribute can be changed even if the class is already there
3254
3255 remove_class ($class)
3256 Remove $class from the element "class" attribute.
3257
3258 Note that classes are then sorted alphabetically, so the "class"
3259 attribute can be changed even if the class is already there
3260
3261 add_to_class ($class)
3262 alias for add_class
3263
3264 att_to_class ($att)
3265 Set the "class" attribute to the value of attribute $att
3266
3267 add_att_to_class ($att)
3268 Add the value of attribute $att to the "class" attribute of the
3269 element
3270
3271 move_att_to_class ($att)
3272 Add the value of attribute $att to the "class" attribute of the
3273 element and delete the attribute
3274
3275 tag_to_class
3276 Set the "class" attribute of the element to the element tag
3277
3278 add_tag_to_class
3279 Add the element tag to its "class" attribute
3280
3281 set_tag_class ($new_tag)
3282 Add the element tag to its "class" attribute and sets the tag to
3283 $new_tag
3284
3285 in_class ($class)
3286 Return true (1) if the element is in the class $class (if $class is
3287 one of the tokens in the element "class" attribute)
3288
3289 tag_to_span
3290 Change the element tag tp "span" and set its class to the old tag
3291
3292 tag_to_div
3293 Change the element tag tp "div" and set its class to the old tag
3294
3295 DESTROY
3296 Frees the element from memory.
3297
3298 start_tag
3299 Return the string for the start tag for the element, including the
3300 "/>" at the end of an empty element tag
3301
3302 end_tag
3303 Return the string for the end tag of an element. For an empty
3304 element, this returns the empty string ('').
3305
3306 xml_string @optional_options
3307 Equivalent to "$elt->sprint( 1)", returns the string for the entire
3308 element, excluding the element's tags (but nested element tags are
3309 present)
3310
3311 The '"no_recurse"' option will only return the text of the element,
3312 not of any included sub-elements (same as "xml_text_only").
3313
3314 inner_xml
3315 Another synonym for xml_string
3316
3317 outer_xml
3318 An other synonym for sprint
3319
3320 xml_text
3321 Return the text of the element, encoded (and processed by the
3322 current "output_filter" or "output_encoding" options, without any
3323 tag.
3324
3325 xml_text_only
3326 Same as "xml_text" except that the text returned doesn't include
3327 the text of sub-elements.
3328
3329 set_pretty_print ($style)
3330 Set the pretty print method, amongst '"none"' (default),
3331 '"nsgmls"', '"nice"', '"indented"', '"record"' and '"record_c"'
3332
3333 pretty_print styles:
3334
3335 none
3336 the default, no "\n" is used
3337
3338 nsgmls
3339 nsgmls style, with "\n" added within tags
3340
3341 nice
3342 adds "\n" wherever possible (NOT SAFE, can lead to invalid XML)
3343
3344 indented
3345 same as "nice" plus indents elements (NOT SAFE, can lead to
3346 invalid XML)
3347
3348 record
3349 table-oriented pretty print, one field per line
3350
3351 record_c
3352 table-oriented pretty print, more compact than "record", one
3353 record per line
3354
3355 set_empty_tag_style ($style)
3356 Set the method to output empty tags, amongst '"normal"' (default),
3357 '"html"', and '"expand"',
3358
3359 "normal" outputs an empty tag '"<tag/>"', "html" adds a space
3360 '"<tag />"' for elements that can be empty in XHTML and "expand"
3361 outputs '"<tag></tag>"'
3362
3363 set_remove_cdata ($flag)
3364 set (or unset) the flag that forces the twig to output CDATA
3365 sections as regular (escaped) PCDATA
3366
3367 set_indent ($string)
3368 Set the indentation for the indented pretty print style (default is
3369 2 spaces)
3370
3371 set_quote ($quote)
3372 Set the quotes used for attributes. can be '"double"' (default) or
3373 '"single"'
3374
3375 cmp ($elt)
3376 Compare the order of the 2 elements in a twig.
3377
3378 C<$a> is the <A>..</A> element, C<$b> is the <B>...</B> element
3379
3380 document $a->cmp( $b)
3381 <A> ... </A> ... <B> ... </B> -1
3382 <A> ... <B> ... </B> ... </A> -1
3383 <B> ... </B> ... <A> ... </A> 1
3384 <B> ... <A> ... </A> ... </B> 1
3385 $a == $b 0
3386 $a and $b not in the same tree undef
3387
3388 before ($elt)
3389 Return 1 if $elt starts before the element, 0 otherwise. If the 2
3390 elements are not in the same twig then return "undef".
3391
3392 if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
3393
3394 after ($elt)
3395 Return 1 if $elt starts after the element, 0 otherwise. If the 2
3396 elements are not in the same twig then return "undef".
3397
3398 if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
3399
3400 other comparison methods
3401 lt
3402 le
3403 gt
3404 ge
3405 path
3406 Return the element context in a form similar to XPath's short form:
3407 '"/root/tag1/../tag"'
3408
3409 xpath
3410 Return a unique XPath expression that can be used to find the
3411 element again.
3412
3413 It looks like "/doc/sect[3]/title": unique elements do not have an
3414 index, the others do.
3415
3416 flush
3417 flushes the twig up to the current element (strictly equivalent to
3418 "$elt->root->flush")
3419
3420 private methods
3421 Low-level methods on the twig:
3422
3423 set_parent ($parent)
3424 set_first_child ($first_child)
3425 set_last_child ($last_child)
3426 set_prev_sibling ($prev_sibling)
3427 set_next_sibling ($next_sibling)
3428 set_twig_current
3429 del_twig_current
3430 twig_current
3431 contains_text
3432
3433 Those methods should not be used, unless of course you find some
3434 creative and interesting, not to mention useful, ways to do it.
3435
3436 cond
3437 Most of the navigation functions accept a condition as an optional
3438 argument The first element (or all elements for "children " or
3439 "ancestors ") that passes the condition is returned.
3440
3441 The condition is a single step of an XPath expression using the XPath
3442 subset defined by "get_xpath". Additional conditions are:
3443
3444 The condition can be
3445
3446 #ELT
3447 return a "real" element (not a PCDATA, CDATA, comment or pi
3448 element)
3449
3450 #TEXT
3451 return a PCDATA or CDATA element
3452
3453 regular expression
3454 return an element whose tag matches the regexp. The regexp has to
3455 be created with "qr//" (hence this is available only on perl 5.005
3456 and above)
3457
3458 code reference
3459 applies the code, passing the current element as argument, if the
3460 code returns true then the element is returned, if it returns false
3461 then the code is applied to the next candidate.
3462
3463 XML::Twig::XPath
3464 XML::Twig implements a subset of XPath through the "get_xpath" method.
3465
3466 If you want to use the whole XPath power, then you can use
3467 "XML::Twig::XPath" instead. In this case "XML::Twig" uses "XML::XPath"
3468 to execute XPath queries. You will of course need "XML::XPath"
3469 installed to be able to use "XML::Twig::XPath".
3470
3471 See XML::XPath for more information.
3472
3473 The methods you can use are:
3474
3475 findnodes ($path)
3476 return a list of nodes found by $path.
3477
3478 findnodes_as_string ($path)
3479 return the nodes found reproduced as XML. The result is not
3480 guaranteed to be valid XML though.
3481
3482 findvalue ($path)
3483 return the concatenation of the text content of the result nodes
3484
3485 In order for "XML::XPath" to be used as the XPath engine the following
3486 methods are included in "XML::Twig":
3487
3488 in XML::Twig
3489
3490 getRootNode
3491 getParentNode
3492 getChildNodes
3493
3494 in XML::Twig::Elt
3495
3496 string_value
3497 toString
3498 getName
3499 getRootNode
3500 getNextSibling
3501 getPreviousSibling
3502 isElementNode
3503 isTextNode
3504 isPI
3505 isPINode
3506 isProcessingInstructionNode
3507 isComment
3508 isCommentNode
3509 getTarget
3510 getChildNodes
3511 getElementById
3512
3513 XML::Twig::XPath::Elt
3514 The methods you can use are the same as on "XML::Twig::XPath" elements:
3515
3516 findnodes ($path)
3517 return a list of nodes found by $path.
3518
3519 findnodes_as_string ($path)
3520 return the nodes found reproduced as XML. The result is not
3521 guaranteed to be valid XML though.
3522
3523 findvalue ($path)
3524 return the concatenation of the text content of the result nodes
3525
3526 XML::Twig::Entity_list
3527 new Create an entity list.
3528
3529 add ($ent)
3530 Add an entity to an entity list.
3531
3532 add_new_ent ($name, $val, $sysid, $pubid, $ndata, $param)
3533 Create a new entity and add it to the entity list
3534
3535 delete ($ent or $tag).
3536 Delete an entity (defined by its name or by the Entity object) from
3537 the list.
3538
3539 print ($optional_filehandle)
3540 Print the entity list.
3541
3542 list
3543 Return the list as an array
3544
3545 XML::Twig::Entity
3546 new ($name, $val, $sysid, $pubid, $ndata, $param)
3547 Same arguments as the Entity handler for XML::Parser.
3548
3549 print ($optional_filehandle)
3550 Print an entity declaration.
3551
3552 name
3553 Return the name of the entity
3554
3555 val Return the value of the entity
3556
3557 sysid
3558 Return the system id for the entity (for NDATA entities)
3559
3560 pubid
3561 Return the public id for the entity (for NDATA entities)
3562
3563 ndata
3564 Return true if the entity is an NDATA entity
3565
3566 param
3567 Return true if the entity is a parameter entity
3568
3569 text
3570 Return the entity declaration text.
3571
3572 XML::Twig::Notation_list
3573 new Create an notation list.
3574
3575 add ($notation)
3576 Add an notation to an notation list.
3577
3578 add_new_notation ($name, $base, $sysid, $pubid)
3579 Create a new notation and add it to the notation list
3580
3581 delete ($notation or $tag).
3582 Delete an notation (defined by its name or by the Notation object)
3583 from the list.
3584
3585 print ($optional_filehandle)
3586 Print the notation list.
3587
3588 list
3589 Return the list as an array
3590
3591 XML::Twig::Notation
3592 new ($name, $base, $sysid, $pubid)
3593 Same argumnotations as the Notation handler for XML::Parser.
3594
3595 print ($optional_filehandle)
3596 Print an notation declaration.
3597
3598 name
3599 Return the name of the notation
3600
3601 base
3602 Return the base to be used for resolving a relative URI
3603
3604 sysid
3605 Return the system id for the notation
3606
3607 pubid
3608 Return the public id for the notation
3609
3610 text
3611 Return the notation declaration text.
3612
3614 Additional examples (and a complete tutorial) can be found on the
3615 XML::Twig Page<http://www.xmltwig.org/xmltwig/>
3616
3617 To figure out what flush does call the following script with an XML
3618 file and an element name as arguments
3619
3620 use XML::Twig;
3621
3622 my ($file, $elt)= @ARGV;
3623 my $t= XML::Twig->new( twig_handlers =>
3624 { $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} });
3625 $t->parsefile( $file, ErrorContext => 2);
3626 $t->flush;
3627 print "\n";
3628
3630 Subclassing XML::Twig
3631 Useful methods:
3632
3633 elt_class
3634 In order to subclass "XML::Twig" you will probably need to subclass
3635 also "XML::Twig::Elt". Use the "elt_class" option when you create
3636 the "XML::Twig" object to get the elements created in a different
3637 class (which should be a subclass of "XML::Twig::Elt".
3638
3639 add_options
3640 If you inherit "XML::Twig" new method but want to add more options
3641 to it you can use this method to prevent XML::Twig to issue
3642 warnings for those additional options.
3643
3644 DTD Handling
3645 There are 3 possibilities here. They are:
3646
3647 No DTD
3648 No doctype, no DTD information, no entity information, the world is
3649 simple...
3650
3651 Internal DTD
3652 The XML document includes an internal DTD, and maybe entity
3653 declarations.
3654
3655 If you use the load_DTD option when creating the twig the DTD
3656 information and the entity declarations can be accessed.
3657
3658 The DTD and the entity declarations will be "flush"'ed (or
3659 "print"'ed) either as is (if they have not been modified) or as
3660 reconstructed (poorly, comments are lost, order is not kept, due to
3661 it's content this DTD should not be viewed by anyone) if they have
3662 been modified. You can also modify them directly by changing the
3663 "$twig->{twig_doctype}->{internal}" field (straight from
3664 XML::Parser, see the "Doctype" handler doc)
3665
3666 External DTD
3667 The XML document includes a reference to an external DTD, and maybe
3668 entity declarations.
3669
3670 If you use the "load_DTD" when creating the twig the DTD
3671 information and the entity declarations can be accessed. The entity
3672 declarations will be "flush"'ed (or "print"'ed) either as is (if
3673 they have not been modified) or as reconstructed (badly, comments
3674 are lost, order is not kept).
3675
3676 You can change the doctype through the "$twig->set_doctype" method
3677 and print the dtd through the "$twig->dtd_text" or
3678 "$twig->dtd_print"
3679 methods.
3680
3681 If you need to modify the entity list this is probably the easiest
3682 way to do it.
3683
3684 Flush
3685 Remember that element handlers are called when the element is CLOSED,
3686 so if you have handlers for nested elements the inner handlers will be
3687 called first. It makes it for example trickier than it would seem to
3688 number nested sections (or clauses, or divs), as the titles in the
3689 inner sections are handled before the outer sections.
3690
3692 segfault during parsing
3693 This happens when parsing huge documents, or lots of small ones,
3694 with a version of Perl before 5.16.
3695
3696 This is due to a bug in the way weak references are handled in Perl
3697 itself.
3698
3699 The fix is either to upgrade to Perl 5.16 or later ("perlbrew" is a
3700 great tool to manage several installations of perl on the same
3701 machine).
3702
3703 An other, NOT RECOMMENDED, way of fixing the problem, is to switch
3704 off weak references by writing "XML::Twig::_set_weakrefs( 0);" at
3705 the top of the code. This is totally unsupported, and may lead to
3706 other problems though,
3707
3708 entity handling
3709 Due to XML::Parser behaviour, non-base entities in attribute values
3710 disappear if they are not declared in the document:
3711 "att="val&ent;"" will be turned into "att => val", unless you use
3712 the "keep_encoding" argument to "XML::Twig->new"
3713
3714 DTD handling
3715 The DTD handling methods are quite bugged. No one uses them and it
3716 seems very difficult to get them to work in all cases, including
3717 with several slightly incompatible versions of XML::Parser and of
3718 libexpat.
3719
3720 Basically you can read the DTD, output it back properly, and update
3721 entities, but not much more.
3722
3723 So use XML::Twig with standalone documents, or with documents
3724 referring to an external DTD, but don't expect it to properly parse
3725 and even output back the DTD.
3726
3727 memory leak
3728 If you use a REALLY old Perl (5.005!) and a lot of twigs you might
3729 find that you leak quite a lot of memory (about 2Ks per twig). You
3730 can use the "dispose " method to free that memory after you are
3731 done.
3732
3733 If you create elements the same thing might happen, use the
3734 "delete" method to get rid of them.
3735
3736 Alternatively installing the "Scalar::Util" (or "WeakRef") module
3737 on a version of Perl that supports it (>5.6.0) will get rid of the
3738 memory leaks automagically.
3739
3740 ID list
3741 The ID list is NOT updated when elements are cut or deleted.
3742
3743 change_gi
3744 This method will not function properly if you do:
3745
3746 $twig->change_gi( $old1, $new);
3747 $twig->change_gi( $old2, $new);
3748 $twig->change_gi( $new, $even_newer);
3749
3750 sanity check on XML::Parser method calls
3751 XML::Twig should really prevent calls to some XML::Parser methods,
3752 especially the "setHandlers" method.
3753
3754 pretty printing
3755 Pretty printing (at least using the '"indented"' style) is hard to
3756 get right! Only elements that belong to the document will be
3757 properly indented. Printing elements that do not belong to the twig
3758 makes it impossible for XML::Twig to figure out their depth, and
3759 thus their indentation level.
3760
3761 Also there is an unavoidable bug when using "flush" and pretty
3762 printing for elements with mixed content that start with an
3763 embedded element:
3764
3765 <elt><b>b</b>toto<b>bold</b></elt>
3766
3767 will be output as
3768
3769 <elt>
3770 <b>b</b>toto<b>bold</b></elt>
3771
3772 if you flush the twig when you find the "<b>" element
3773
3775 These are the things that can mess up calling code, especially if
3776 threaded. They might also cause problem under mod_perl.
3777
3778 Exported constants
3779 Whether you want them or not you get them! These are subroutines to
3780 use as constant when creating or testing elements
3781
3782 PCDATA return '#PCDATA'
3783 CDATA return '#CDATA'
3784 PI return '#PI', I had the choice between PROC and PI :--(
3785
3786 Module scoped values: constants
3787 these should cause no trouble:
3788
3789 %base_ent= ( '>' => '>',
3790 '<' => '<',
3791 '&' => '&',
3792 "'" => ''',
3793 '"' => '"',
3794 );
3795 CDATA_START = "<![CDATA[";
3796 CDATA_END = "]]>";
3797 PI_START = "<?";
3798 PI_END = "?>";
3799 COMMENT_START = "<!--";
3800 COMMENT_END = "-->";
3801
3802 pretty print styles
3803
3804 ( $NSGMLS, $NICE, $INDENTED, $INDENTED_C, $WRAPPED, $RECORD1, $RECORD2)= (1..7);
3805
3806 empty tag output style
3807
3808 ( $HTML, $EXPAND)= (1..2);
3809
3810 Module scoped values: might be changed
3811 Most of these deal with pretty printing, so the worst that can
3812 happen is probably that XML output does not look right, but is
3813 still valid and processed identically by XML processors.
3814
3815 $empty_tag_style can mess up HTML bowsers though and changing $ID
3816 would most likely create problems.
3817
3818 $pretty=0; # pretty print style
3819 $quote='"'; # quote for attributes
3820 $INDENT= ' '; # indent for indented pretty print
3821 $empty_tag_style= 0; # how to display empty tags
3822 $ID # attribute used as an id ('id' by default)
3823
3824 Module scoped values: definitely changed
3825 These 2 variables are used to replace tags by an index, thus saving
3826 some space when creating a twig. If they really cause you too much
3827 trouble, let me know, it is probably possible to create either a
3828 switch or at least a version of XML::Twig that does not perform
3829 this optimization.
3830
3831 %gi2index; # tag => index
3832 @index2gi; # list of tags
3833
3834 If you need to manipulate all those values, you can use the following
3835 methods on the XML::Twig object:
3836
3837 global_state
3838 Return a hashref with all the global variables used by XML::Twig
3839
3840 The hash has the following fields: "pretty", "quote", "indent",
3841 "empty_tag_style", "keep_encoding", "expand_external_entities",
3842 "output_filter", "output_text_filter", "keep_atts_order"
3843
3844 set_global_state ($state)
3845 Set the global state, $state is a hashref
3846
3847 save_global_state
3848 Save the current global state
3849
3850 restore_global_state
3851 Restore the previously saved (using "Lsave_global_state"> state
3852
3854 SAX handlers
3855 Allowing XML::Twig to work on top of any SAX parser
3856
3857 multiple twigs are not well supported
3858 A number of twig features are just global at the moment. These
3859 include the ID list and the "tag pool" (if you use "change_gi" then
3860 you change the tag for ALL twigs).
3861
3862 A future version will try to support this while trying not to be to
3863 hard on performance (at least when a single twig is used!).
3864
3866 Michel Rodriguez <mirod@cpan.org>
3867
3869 This library is free software; you can redistribute it and/or modify it
3870 under the same terms as Perl itself.
3871
3872 Bug reports should be sent using: RT
3873 <http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Twig>
3874
3875 Comments can be sent to mirod@cpan.org
3876
3877 The XML::Twig page is at <http://www.xmltwig.org/xmltwig/> It includes
3878 the development version of the module, a slightly better version of the
3879 documentation, examples, a tutorial and a: Processing XML efficiently
3880 with Perl and XML::Twig:
3881 <http://www.xmltwig.org/xmltwig/tutorial/index.html>
3882
3884 Complete docs, including a tutorial, examples, an easier to use HTML
3885 version of the docs, a quick reference card and a FAQ are available at
3886 <http://www.xmltwig.org/xmltwig/>
3887
3888 git repository at <http://github.com/mirod/xmltwig>
3889
3890 XML::Parser, XML::Parser::Expat, XML::XPath, Encode, Text::Iconv,
3891 Scalar::Utils
3892
3893 Alternative Modules
3894 XML::Twig is not the only XML::Processing module available on CPAN (far
3895 from it!).
3896
3897 The main alternative I would recommend is XML::LibXML.
3898
3899 Here is a quick comparison of the 2 modules:
3900
3901 XML::LibXML, actually "libxml2" on which it is based, sticks to the
3902 standards, and implements a good number of them in a rather strict way:
3903 XML, XPath, DOM, RelaxNG, I must be forgetting a couple (XInclude?). It
3904 is fast and rather frugal memory-wise.
3905
3906 XML::Twig is older: when I started writing it XML::Parser/expat was the
3907 only game in town. It implements XML and that's about it (plus a subset
3908 of XPath, and you can use XML::Twig::XPath if you have XML::XPathEngine
3909 installed for full support). It is slower and requires more memory for
3910 a full tree than XML::LibXML. On the plus side (yes, there is a plus
3911 side!) it lets you process a big document in chunks, and thus let you
3912 tackle documents that couldn't be loaded in memory by XML::LibXML, and
3913 it offers a lot (and I mean a LOT!) of higher-level methods, for
3914 everything, from adding structure to "low-level" XML, to shortcuts for
3915 XHTML conversions and more. It also DWIMs quite a bit, getting comments
3916 and non-significant whitespaces out of the way but preserving them in
3917 the output for example. As it does not stick to the DOM, is also
3918 usually leads to shorter code than in XML::LibXML.
3919
3920 Beyond the pure features of the 2 modules, XML::LibXML seems to be
3921 preferred by "XML-purists", while XML::Twig seems to be more used by
3922 Perl Hackers who have to deal with XML. As you have noted, XML::Twig
3923 also comes with quite a lot of docs, but I am sure if you ask for help
3924 about XML::LibXML here or on Perlmonks you will get answers.
3925
3926 Note that it is actually quite hard for me to compare the 2 modules: on
3927 one hand I know XML::Twig inside-out and I can get it to do pretty much
3928 anything I need to (or I improve it ;--), while I have a very basic
3929 knowledge of XML::LibXML. So feature-wise, I'd rather use XML::Twig
3930 ;--). On the other hand, I am painfully aware of some of the
3931 deficiencies, potential bugs and plain ugly code that lurk in
3932 XML::Twig, even though you are unlikely to be affected by them (unless
3933 for example you need to change the DTD of a document programmatically),
3934 while I haven't looked much into XML::LibXML so it still looks shinny
3935 and clean to me.
3936
3937 That said, if you need to process a document that is too big to fit
3938 memory and XML::Twig is too slow for you, my reluctant advice would be
3939 to use "bare" XML::Parser. It won't be as easy to use as XML::Twig:
3940 basically with XML::Twig you trade some speed (depending on what you do
3941 from a factor 3 to... none) for ease-of-use, but it will be easier IMHO
3942 than using SAX (albeit not standard), and at this point a LOT faster
3943 (see the last test in
3944 <http://www.xmltwig.org/article/simple_benchmark/>).
3945
3946
3947
3948perl v5.34.0 2021-07-23 Twig(3)