1Twig(3) User Contributed Perl Documentation Twig(3)
2
3
4
6 XML::Twig - A perl module for processing huge XML documents in tree
7 mode.
8
10 Note that this documentation is intended as a reference to the module.
11
12 Complete docs, including a tutorial, examples, an easier to use HTML
13 version, a quick reference card and a FAQ are available at
14 <http://www.xmltwig.org/xmltwig>
15
16 Small documents (loaded in memory as a tree):
17
18 my $twig=XML::Twig->new(); # create the twig
19 $twig->parsefile( 'doc.xml'); # build it
20 my_process( $twig); # use twig methods to process it
21 $twig->print; # output the twig
22
23 Huge documents (processed in combined stream/tree mode):
24
25 # at most one div will be loaded in memory
26 my $twig=XML::Twig->new(
27 twig_handlers =>
28 { title => sub { $_->set_tag( 'h2') }, # change title tags to h2
29 para => sub { $_->set_tag( 'p') }, # change para to p
30 hidden => sub { $_->delete; }, # remove hidden elements
31 list => \&my_list_process, # process list elements
32 div => sub { $_[0]->flush; }, # output and free memory
33 },
34 pretty_print => 'indented', # output will be nicely formatted
35 empty_tags => 'html', # outputs <empty_tag />
36 );
37 $twig->parsefile( 'my_big.xml');
38
39 See XML::Twig 101 for other ways to use the module, as a filter for
40 example.
41
43 This module provides a way to process XML documents. It is build on top
44 of "XML::Parser".
45
46 The module offers a tree interface to the document, while allowing you
47 to output the parts of it that have been completely processed.
48
49 It allows minimal resource (CPU and memory) usage by building the tree
50 only for the parts of the documents that need actual processing,
51 through the use of the "twig_roots " and "twig_print_outside_roots "
52 options. The "finish " and "finish_print " methods also help to
53 increase performances.
54
55 XML::Twig tries to make simple things easy so it tries its best to
56 takes care of a lot of the (usually) annoying (but sometimes necessary)
57 features that come with XML and XML::Parser.
58
60 XML::Twig can be used either on "small" XML documents (that fit in
61 memory) or on huge ones, by processing parts of the document and
62 outputting or discarding them once they are processed.
63
64 Loading an XML document and processing it
65 my $t= XML::Twig->new();
66 $t->parse( '<d><title>title</title><para>p 1</para><para>p 2</para></d>');
67 my $root= $t->root;
68 $root->set_tag( 'html'); # change doc to html
69 $title= $root->first_child( 'title'); # get the title
70 $title->set_tag( 'h1'); # turn it into h1
71 my @para= $root->children( 'para'); # get the para children
72 foreach my $para (@para)
73 { $para->set_tag( 'p'); } # turn them into p
74 $t->print; # output the document
75
76 Other useful methods include:
77
78 att: "$elt->{'att'}->{'foo'}" return the "foo" attribute for an
79 element,
80
81 set_att : "$elt->set_att( foo => "bar")" sets the "foo" attribute to
82 the "bar" value,
83
84 next_sibling: "$elt->{next_sibling}" return the next sibling in the
85 document (in the example "$title->{next_sibling}" is the first "para",
86 you can also (and actually should) use "$elt->next_sibling( 'para')" to
87 get it
88
89 The document can also be transformed through the use of the cut, copy,
90 paste and move methods: "$title->cut; $title->paste( after => $p);" for
91 example
92
93 And much, much more, see XML::Twig::Elt.
94
95 Processing an XML document chunk by chunk
96 One of the strengths of XML::Twig is that it let you work with files
97 that do not fit in memory (BTW storing an XML document in memory as a
98 tree is quite memory-expensive, the expansion factor being often around
99 10).
100
101 To do this you can define handlers, that will be called once a specific
102 element has been completely parsed. In these handlers you can access
103 the element and process it as you see fit, using the navigation and the
104 cut-n-paste methods, plus lots of convenient ones like "prefix ". Once
105 the element is completely processed you can then "flush " it, which
106 will output it and free the memory. You can also "purge " it if you
107 don't need to output it (if you are just extracting some data from the
108 document for example). The handler will be called again once the next
109 relevant element has been parsed.
110
111 my $t= XML::Twig->new( twig_handlers =>
112 { section => \§ion,
113 para => sub { $_->set_tag( 'p'); }
114 },
115 );
116 $t->parsefile( 'doc.xml');
117
118 # the handler is called once a section is completely parsed, ie when
119 # the end tag for section is found, it receives the twig itself and
120 # the element (including all its sub-elements) as arguments
121 sub section
122 { my( $t, $section)= @_; # arguments for all twig_handlers
123 $section->set_tag( 'div'); # change the tag name.4, my favourite method...
124 # let's use the attribute nb as a prefix to the title
125 my $title= $section->first_child( 'title'); # find the title
126 my $nb= $title->{'att'}->{'nb'}; # get the attribute
127 $title->prefix( "$nb - "); # easy isn't it?
128 $section->flush; # outputs the section and frees memory
129 }
130
131 There is of course more to it: you can trigger handlers on more
132 elaborate conditions than just the name of the element, "section/title"
133 for example.
134
135 my $t= XML::Twig->new( twig_handlers =>
136 { 'section/title' => sub { $_->print } }
137 )
138 ->parsefile( 'doc.xml');
139
140 Here "sub { $_->print }" simply prints the current element ($_ is
141 aliased to the element in the handler).
142
143 You can also trigger a handler on a test on an attribute:
144
145 my $t= XML::Twig->new( twig_handlers =>
146 { 'section[@level="1"]' => sub { $_->print } }
147 );
148 ->parsefile( 'doc.xml');
149
150 You can also use "start_tag_handlers " to process an element as soon as
151 the start tag is found. Besides "prefix " you can also use "suffix ",
152
153 Processing just parts of an XML document
154 The twig_roots mode builds only the required sub-trees from the
155 document Anything outside of the twig roots will just be ignored:
156
157 my $t= XML::Twig->new(
158 # the twig will include just the root and selected titles
159 twig_roots => { 'section/title' => \&print_n_purge,
160 'annex/title' => \&print_n_purge
161 }
162 );
163 $t->parsefile( 'doc.xml');
164
165 sub print_n_purge
166 { my( $t, $elt)= @_;
167 print $elt->text; # print the text (including sub-element texts)
168 $t->purge; # frees the memory
169 }
170
171 You can use that mode when you want to process parts of a documents but
172 are not interested in the rest and you don't want to pay the price,
173 either in time or memory, to build the tree for the it.
174
175 Building an XML filter
176 You can combine the "twig_roots" and the "twig_print_outside_roots"
177 options to build filters, which let you modify selected elements and
178 will output the rest of the document as is.
179
180 This would convert prices in $ to prices in Euro in a document:
181
182 my $t= XML::Twig->new(
183 twig_roots => { 'price' => \&convert, }, # process prices
184 twig_print_outside_roots => 1, # print the rest
185 );
186 $t->parsefile( 'doc.xml');
187
188 sub convert
189 { my( $t, $price)= @_;
190 my $currency= $price->{'att'}->{'currency'}; # get the currency
191 if( $currency eq 'USD')
192 { $usd_price= $price->text; # get the price
193 # %rate is just a conversion table
194 my $euro_price= $usd_price * $rate{usd2euro};
195 $price->set_text( $euro_price); # set the new price
196 $price->set_att( currency => 'EUR'); # don't forget this!
197 }
198 $price->print; # output the price
199 }
200
201 XML::Twig and various versions of Perl, XML::Parser and expat:
202 XML::Twig is a lot more sensitive to variations in versions of perl,
203 XML::Parser and expat than to the OS, so this should cover some
204 reasonable configurations.
205
206 The "recommended configuration" is perl 5.8.3+ (for good Unicode
207 support), XML::Parser 2.31+ and expat 1.95.5+
208
209 See <http://testers.cpan.org/search?request=dist&dist=XML-Twig> for the
210 CPAN testers reports on XML::Twig, which list all tested
211 configurations.
212
213 An Atom feed of the CPAN Testers results is available at
214 <http://xmltwig.org/rss/twig_testers.rss>
215
216 Finally:
217
218 XML::Twig does NOT work with expat 1.95.4
219 XML::Twig only works with XML::Parser 2.27 in perl 5.6.*
220 Note that I can't compile XML::Parser 2.27 anymore, so I can't
221 guarantee that it still works
222
223 XML::Parser 2.28 does not really work
224
225 When in doubt, upgrade expat, XML::Parser and Scalar::Util
226
227 Finally, for some optional features, XML::Twig depends on some
228 additional modules. The complete list, which depends somewhat on the
229 version of Perl that you are running, is given by running
230 "t/zz_dump_config.t"
231
233 Whitespaces
234 Whitespaces that look non-significant are discarded, this behaviour
235 can be controlled using the "keep_spaces ", "keep_spaces_in " and
236 "discard_spaces_in " options.
237
238 Encoding
239 You can specify that you want the output in the same encoding as
240 the input (provided you have valid XML, which means you have to
241 specify the encoding either in the document or when you create the
242 Twig object) using the "keep_encoding " option
243
244 You can also use "output_encoding" to convert the internal UTF-8
245 format to the required encoding.
246
247 Comments and Processing Instructions (PI)
248 Comments and PI's can be hidden from the processing, but still
249 appear in the output (they are carried by the "real" element closer
250 to them)
251
252 Pretty Printing
253 XML::Twig can output the document pretty printed so it is easier to
254 read for us humans.
255
256 Surviving an untimely death
257 XML parsers are supposed to react violently when fed improper XML.
258 XML::Parser just dies.
259
260 XML::Twig provides the "safe_parse " and the "safe_parsefile "
261 methods which wrap the parse in an eval and return either the
262 parsed twig or 0 in case of failure.
263
264 Private attributes
265 Attributes with a name starting with # (illegal in XML) will not be
266 output, so you can safely use them to store temporary values during
267 processing. Note that you can store anything in a private
268 attribute, not just text, it's just a regular Perl variable, so a
269 reference to an object or a huge data structure is perfectly fine.
270
272 XML::Twig uses a very limited number of classes. The ones you are most
273 likely to use are "XML::Twig" of course, which represents a complete
274 XML document, including the document itself (the root of the document
275 itself is "root"), its handlers, its input or output filters... The
276 other main class is "XML::Twig::Elt", which models an XML element.
277 Element here has a very wide definition: it can be a regular element,
278 or but also text, with an element "tag" of "#PCDATA" (or "#CDATA"), an
279 entity (tag is "#ENT"), a Processing Instruction ("#PI"), a comment
280 ("#COMMENT").
281
282 Those are the 2 commonly used classes.
283
284 You might want to look the "elt_class" option if you want to subclass
285 "XML::Twig::Elt".
286
287 Attributes are just attached to their parent element, they are not
288 objects per se. (Please use the provided methods "att" and "set_att" to
289 access them, if you access them as a hash, then your code becomes
290 implementaion dependent and might break in the future).
291
292 Other classes that are seldom used are "XML::Twig::Entity_list" and
293 "XML::Twig::Entity".
294
295 If you use "XML::Twig::XPath" instead of "XML::Twig", elements are then
296 created as "XML::Twig::XPath::Elt"
297
299 XML::Twig
300 A twig is a subclass of XML::Parser, so all XML::Parser methods can be
301 called on a twig object, including parse and parsefile. "setHandlers"
302 on the other hand cannot be used, see "BUGS "
303
304 new This is a class method, the constructor for XML::Twig. Options are
305 passed as keyword value pairs. Recognized options are the same as
306 XML::Parser, plus some (in fact a lot!) XML::Twig specifics.
307
308 New Options:
309
310 twig_handlers
311 This argument consists of a hash "{ expression =" \&handler}>
312 where expression is a an XPath-like expression (+ some others).
313
314 XPath expressions are limited to using the child and descendant
315 axis (indeed you can't specify an axis), and predicates cannot
316 be nested. You can use the "string", or "string(<tag>)"
317 function (except in "twig_roots" triggers).
318
319 Additionally you can use regexps (/ delimited) to match
320 attribute and string values.
321
322 Examples:
323
324 foo
325 foo/bar
326 foo//bar
327 /foo/bar
328 /foo//bar
329 /foo/bar[@att1 = "val1" and @att2 = "val2"]/baz[@a >= 1]
330 foo[string()=~ /^duh!+/]
331 /foo[string(bar)=~ /\d+/]/baz[@att != 3]
332
333 #CDATA can be used to call a handler for a CDATA section.
334 #COMMENT can be used to call a handler for comments
335
336 Some additional (non-XPath) expressions are also provided for
337 convenience:
338
339 processing instructions
340 '?' or '#PI' triggers the handler for any processing
341 instruction, and '?<target>' or '#PI <target>' triggers a
342 handler for processing instruction with the given target(
343 ex: '#PI xml-stylesheet').
344
345 level(<level>)
346 Triggers the handler on any element at that level in the
347 tree (root is level 1)
348
349 _all_
350 Triggers the handler for all elements in the tree
351
352 _default_
353 Triggers the handler for each element that does NOT have
354 any other handler.
355
356 Expressions are evaluated against the input document. Which
357 means that even if you have changed the tag of an element
358 (changing the tag of a parent element from a handler for
359 example) the change will not impact the expression evaluation.
360 There is an exception to this: "private" attributes (which name
361 start with a '#', and can only be created during the parsing,
362 as they are not valid XML) are checked against the current
363 twig.
364
365 Handlers are triggered in fixed order, sorted by their type
366 (xpath expressions first, then regexps, then level), then by
367 whether they specify a full path (starting at the root element)
368 or not, then by by number of steps in the expression , then
369 number of predicates, then number of tests in predicates.
370 Handlers where the last step does not specify a step
371 ("foo/bar/*") are triggered after other XPath handlers. Finally
372 "_all_" handlers are triggered last.
373
374 Important: once a handler has been triggered if it returns 0
375 then no other handler is called, except a "_all_" handler which
376 will be called anyway.
377
378 If a handler returns a true value and other handlers apply,
379 then the next applicable handler will be called. Repeat, rinse,
380 lather..; The exception to that rule is when the
381 "do_not_chain_handlers" option is set, in which case only the
382 first handler will be called.
383
384 Note that it might be a good idea to explicitly return a short
385 true value (like 1) from handlers: this ensures that other
386 applicable handlers are called even if the last statement for
387 the handler happens to evaluate to false. This might also
388 speedup the code by avoiding the result of the last statement
389 of the code to be copied and passed to the code managing
390 handlers. It can really pay to have 1 instead of a long string
391 returned.
392
393 When the closing tag for an element is parsed the corresponding
394 handler is called, with 2 arguments: the twig and the "Element
395 ". The twig includes the document tree that has been built so
396 far, the element is the complete sub-tree for the element. The
397 fact that the handler is called only when the closing tag for
398 the element is found means that handlers for inner elements are
399 called before handlers for outer elements.
400
401 $_ is also set to the element, so it is easy to write inline
402 handlers like
403
404 para => sub { $_->set_tag( 'p'); }
405
406 Text is stored in elements whose tag name is #PCDATA (due to
407 mixed content, text and sub-element in an element there is no
408 way to store the text as just an attribute of the enclosing
409 element).
410
411 Warning: if you have used purge or flush on the twig the
412 element might not be complete, some of its children might have
413 been entirely flushed or purged, and the start tag might even
414 have been printed (by "flush") already, so changing its tag
415 might not give the expected result.
416
417 twig_roots
418 This argument let's you build the tree only for those elements
419 you are interested in.
420
421 Example: my $t= XML::Twig->new( twig_roots => { title => 1, subtitle => 1});
422 $t->parsefile( file);
423 my $t= XML::Twig->new( twig_roots => { 'section/title' => 1});
424 $t->parsefile( file);
425
426 return a twig containing a document including only "title" and
427 "subtitle" elements, as children of the root element.
428
429 You can use generic_attribute_condition, attribute_condition,
430 full_path, partial_path, tag, tag_regexp, _default_ and _all_
431 to trigger the building of the twig. string_condition and
432 regexp_condition cannot be used as the content of the element,
433 and the string, have not yet been parsed when the condition is
434 checked.
435
436 WARNING: path are checked for the document. Even if the
437 "twig_roots" option is used they will be checked against the
438 full document tree, not the virtual tree created by XML::Twig
439
440 WARNING: twig_roots elements should NOT be nested, that would
441 hopelessly confuse XML::Twig ;--(
442
443 Note: you can set handlers (twig_handlers) using twig_roots
444 Example: my $t= XML::Twig->new( twig_roots =>
445 { title => sub {
446 $_[1]->print;},
447 subtitle =>
448 \&process_subtitle
449 }
450 );
451 $t->parsefile( file);
452
453 twig_print_outside_roots
454 To be used in conjunction with the "twig_roots" argument. When
455 set to a true value this will print the document outside of the
456 "twig_roots" elements.
457
458 Example: my $t= XML::Twig->new( twig_roots => { title => \&number_title },
459 twig_print_outside_roots => 1,
460 );
461 $t->parsefile( file);
462 { my $nb;
463 sub number_title
464 { my( $twig, $title);
465 $nb++;
466 $title->prefix( "$nb ");
467 $title->print;
468 }
469 }
470
471 This example prints the document outside of the title element,
472 calls "number_title" for each "title" element, prints it, and
473 then resumes printing the document. The twig is built only for
474 the "title" elements.
475
476 If the value is a reference to a file handle then the document
477 outside the "twig_roots" elements will be output to this file
478 handle:
479
480 open( my $out, '>', 'out_file.xml') or die "cannot open out file.xml out_file:$!";
481 my $t= XML::Twig->new( twig_roots => { title => \&number_title },
482 # default output to $out
483 twig_print_outside_roots => $out,
484 );
485
486 { my $nb;
487 sub number_title
488 { my( $twig, $title);
489 $nb++;
490 $title->prefix( "$nb ");
491 $title->print( $out); # you have to print to \*OUT here
492 }
493 }
494
495 start_tag_handlers
496 A hash "{ expression =" \&handler}>. Sets element handlers that
497 are called when the element is open (at the end of the
498 XML::Parser "Start" handler). The handlers are called with 2
499 params: the twig and the element. The element is empty at that
500 point, its attributes are created though.
501
502 You can use generic_attribute_condition, attribute_condition,
503 full_path, partial_path, tag, tag_regexp, _default_ and _all_
504 to trigger the handler.
505
506 string_condition and regexp_condition cannot be used as the
507 content of the element, and the string, have not yet been
508 parsed when the condition is checked.
509
510 The main uses for those handlers are to change the tag name
511 (you might have to do it as soon as you find the open tag if
512 you plan to "flush" the twig at some point in the element, and
513 to create temporary attributes that will be used when
514 processing sub-element with "twig_hanlders".
515
516 You should also use it to change tags if you use "flush". If
517 you change the tag in a regular "twig_handler" then the start
518 tag might already have been flushed.
519
520 Note: "start_tag" handlers can be called outside of
521 "twig_roots" if this argument is used, in this case handlers
522 are called with the following arguments: $t (the twig), $tag
523 (the tag of the element) and %att (a hash of the attributes of
524 the element).
525
526 If the "twig_print_outside_roots" argument is also used, if the
527 last handler called returns a "true" value, then the the start
528 tag will be output as it appeared in the original document, if
529 the handler returns a a "false" value then the start tag will
530 not be printed (so you can print a modified string yourself for
531 example).
532
533 Note that you can use the ignore method in "start_tag_handlers"
534 (and only there).
535
536 end_tag_handlers
537 A hash "{ expression =" \&handler}>. Sets element handlers that
538 are called when the element is closed (at the end of the
539 XML::Parser "End" handler). The handlers are called with 2
540 params: the twig and the tag of the element.
541
542 twig_handlers are called when an element is completely parsed,
543 so why have this redundant option? There is only one use for
544 "end_tag_handlers": when using the "twig_roots" option, to
545 trigger a handler for an element outside the roots. It is for
546 example very useful to number titles in a document using nested
547 sections:
548
549 my @no= (0);
550 my $no;
551 my $t= XML::Twig->new(
552 start_tag_handlers =>
553 { section => sub { $no[$#no]++; $no= join '.', @no; push @no, 0; } },
554 twig_roots =>
555 { title => sub { $_[1]->prefix( $no); $_[1]->print; } },
556 end_tag_handlers => { section => sub { pop @no; } },
557 twig_print_outside_roots => 1
558 );
559 $t->parsefile( $file);
560
561 Using the "end_tag_handlers" argument without "twig_roots" will
562 result in an error.
563
564 do_not_chain_handlers
565 If this option is set to a true value, then only one handler
566 will be called for each element, even if several satisfy the
567 condition
568
569 Note that the "_all_" handler will still be called regardless
570
571 ignore_elts
572 This option lets you ignore elements when building the twig.
573 This is useful in cases where you cannot use "twig_roots" to
574 ignore elements, for example if the element to ignore is a
575 sibling of elements you are interested in.
576
577 Example:
578
579 my $twig= XML::Twig->new( ignore_elts => { elt => 'discard' });
580 $twig->parsefile( 'doc.xml');
581
582 This will build the complete twig for the document, except that
583 all "elt" elements (and their children) will be left out.
584
585 The keys in the hash are triggers, limited to the same subset
586 as "start_tag_handlers". The values can be "discard", to
587 discard the element, "print", to output the element as-is,
588 "string" to store the text of the ignored element(s), including
589 markup, in a field of the twig: "$t->{twig_buffered_string}" or
590 a reference to a scalar, in which case the text of the ignored
591 element(s), including markup, will be stored in the scalar. Any
592 other value will be treated as "discard".
593
594 char_handler
595 A reference to a subroutine that will be called every time
596 "PCDATA" is found.
597
598 The subroutine receives the string as argument, and returns the
599 modified string:
600
601 # we want all strings in upper case
602 sub my_char_handler
603 { my( $text)= @_;
604 $text= uc( $text);
605 return $text;
606 }
607
608 elt_class
609 The name of a class used to store elements. this class should
610 inherit from "XML::Twig::Elt" (and by default it is
611 "XML::Twig::Elt"). This option is used to subclass the element
612 class and extend it with new methods.
613
614 This option is needed because during the parsing of the XML,
615 elements are created by "XML::Twig", without any control from
616 the user code.
617
618 keep_atts_order
619 Setting this option to a true value causes the attribute hash
620 to be tied to a "Tie::IxHash" object. This means that
621 "Tie::IxHash" needs to be installed for this option to be
622 available. It also means that the hash keeps its order, so you
623 will get the attributes in order. This allows outputting the
624 attributes in the same order as they were in the original
625 document.
626
627 keep_encoding
628 This is a (slightly?) evil option: if the XML document is not
629 UTF-8 encoded and you want to keep it that way, then setting
630 keep_encoding will use the"Expat" original_string method for
631 character, thus keeping the original encoding, as well as the
632 original entities in the strings.
633
634 See the "t/test6.t" test file to see what results you can
635 expect from the various encoding options.
636
637 WARNING: if the original encoding is multi-byte then attribute
638 parsing will be EXTREMELY unsafe under any Perl before 5.6, as
639 it uses regular expressions which do not deal properly with
640 multi-byte characters. You can specify an alternate function to
641 parse the start tags with the "parse_start_tag" option (see
642 below)
643
644 WARNING: this option is NOT used when parsing with the non-
645 blocking parser ("parse_start", "parse_more", parse_done
646 methods) which you probably should not use with XML::Twig
647 anyway as they are totally untested!
648
649 output_encoding
650 This option generates an output_filter using "Encode",
651 "Text::Iconv" or "Unicode::Map8" and "Unicode::Strings", and
652 sets the encoding in the XML declaration. This is the easiest
653 way to deal with encodings, if you need more sophisticated
654 features, look at "output_filter" below
655
656 output_filter
657 This option is used to convert the character encoding of the
658 output document. It is passed either a string corresponding to
659 a predefined filter or a subroutine reference. The filter will
660 be called every time a document or element is processed by the
661 "print" functions ("print", "sprint", "flush").
662
663 Pre-defined filters:
664
665 latin1
666 uses either "Encode", "Text::Iconv" or "Unicode::Map8" and
667 "Unicode::String" or a regexp (which works only with
668 XML::Parser 2.27), in this order, to convert all characters
669 to ISO-8859-15 (usually latin1 is synonym to ISO-8859-1,
670 but in practice it seems that ISO-8859-15, which includes
671 the euro sign, is more useful and probably what most people
672 want).
673
674 html
675 does the same conversion as "latin1", plus encodes entities
676 using "HTML::Entities" (oddly enough you will need to have
677 HTML::Entities installed for it to be available). This
678 should only be used if the tags and attribute names
679 themselves are in US-ASCII, or they will be converted and
680 the output will not be valid XML any more
681
682 safe
683 converts the output to ASCII (US) only plus character
684 entities ("&#nnn;") this should be used only if the tags
685 and attribute names themselves are in US-ASCII, or they
686 will be converted and the output will not be valid XML any
687 more
688
689 safe_hex
690 same as "safe" except that the character entities are in
691 hexa ("&#xnnn;")
692
693 encode_convert ($encoding)
694 Return a subref that can be used to convert utf8 strings to
695 $encoding). Uses "Encode".
696
697 my $conv = XML::Twig::encode_convert( 'latin1');
698 my $t = XML::Twig->new(output_filter => $conv);
699
700 iconv_convert ($encoding)
701 this function is used to create a filter subroutine that
702 will be used to convert the characters to the target
703 encoding using "Text::Iconv" (which needs to be installed,
704 look at the documentation for the module and for the
705 "iconv" library to find out which encodings are available
706 on your system)
707
708 my $conv = XML::Twig::iconv_convert( 'latin1');
709 my $t = XML::Twig->new(output_filter => $conv);
710
711 unicode_convert ($encoding)
712 this function is used to create a filter subroutine that
713 will be used to convert the characters to the target
714 encoding using "Unicode::Strings" and "Unicode::Map8"
715 (which need to be installed, look at the documentation for
716 the modules to find out which encodings are available on
717 your system)
718
719 my $conv = XML::Twig::unicode_convert( 'latin1');
720 my $t = XML::Twig->new(output_filter => $conv);
721
722 The "text" and "att" methods do not use the filter, so their
723 result are always in unicode.
724
725 Those predeclared filters are based on subroutines that can be
726 used by themselves (as "XML::Twig::foo").
727
728 html_encode ($string)
729 Use "HTML::Entities" to encode a utf8 string
730
731 safe_encode ($string)
732 Use either a regexp (perl < 5.8) or "Encode" to encode non-
733 ascii characters in the string in "&#<nnnn>;" format
734
735 safe_encode_hex ($string)
736 Use either a regexp (perl < 5.8) or "Encode" to encode non-
737 ascii characters in the string in "&#x<nnnn>;" format
738
739 regexp2latin1 ($string)
740 Use a regexp to encode a utf8 string into latin 1
741 (ISO-8859-1). Does not work with Perl 5.8.0!
742
743 output_text_filter
744 same as output_filter, except it doesn't apply to the brackets
745 and quotes around attribute values. This is useful for all
746 filters that could change the tagging, basically anything that
747 does not just change the encoding of the output. "html", "safe"
748 and "safe_hex" are better used with this option.
749
750 input_filter
751 This option is similar to "output_filter" except the filter is
752 applied to the characters before they are stored in the twig,
753 at parsing time.
754
755 remove_cdata
756 Setting this option to a true value will force the twig to
757 output CDATA sections as regular (escaped) PCDATA
758
759 parse_start_tag
760 If you use the "keep_encoding" option then this option can be
761 used to replace the default parsing function. You should
762 provide a coderef (a reference to a subroutine) as the
763 argument, this subroutine takes the original tag (given by
764 XML::Parser::Expat "original_string()" method) and returns a
765 tag and the attributes in a hash (or in a list
766 attribute_name/attribute value).
767
768 expand_external_ents
769 When this option is used external entities (that are defined)
770 are expanded when the document is output using "print"
771 functions such as "print ", "sprint ", "flush " and "xml_string
772 ". Note that in the twig the entity will be stored as an
773 element with a tag '"#ENT"', the entity will not be expanded
774 there, so you might want to process the entities before
775 outputting it.
776
777 If an external entity is not available, then the parse will
778 fail.
779
780 A special case is when the value of this option is -1. In that
781 case a missing entity will not cause the parser to die, but its
782 "name", "sysid" and "pubid" will be stored in the twig as
783 "$twig->{twig_missing_system_entities}" (a reference to an
784 array of hashes { name => <name>, sysid => <sysid>, pubid =>
785 <pubid> }). Yes, this is a bit of a hack, but it's useful in
786 some cases.
787
788 load_DTD
789 If this argument is set to a true value, "parse" or "parsefile"
790 on the twig will load the DTD information. This information
791 can then be accessed through the twig, in a "DTD_handler" for
792 example. This will load even an external DTD.
793
794 Default and fixed values for attributes will also be filled,
795 based on the DTD.
796
797 Note that to do this the module will generate a temporary file
798 in the current directory. If this is a problem let me know and
799 I will add an option to specify an alternate directory.
800
801 See "DTD Handling" for more information
802
803 DTD_handler
804 Set a handler that will be called once the doctype (and the
805 DTD) have been loaded, with 2 arguments, the twig and the DTD.
806
807 no_prolog
808 Does not output a prolog (XML declaration and DTD)
809
810 id This optional argument gives the name of an attribute that can
811 be used as an ID in the document. Elements whose ID is known
812 can be accessed through the elt_id method. id defaults to 'id'.
813 See "BUGS "
814
815 discard_spaces
816 If this optional argument is set to a true value then spaces
817 are discarded when they look non-significant: strings
818 containing only spaces and at least one line feed are
819 discarded. This argument is set to true by default.
820
821 The exact algorithm to drop spaces is: strings including only
822 spaces (perl \s) and at least one \n right before an open or
823 close tag are dropped.
824
825 discard_all_spaces
826 If this argument is set to a true value, spaces are discarded
827 more aggressively than with "discard_spaces": strings not
828 including a \n are also dropped. This option is appropriate for
829 data-oriented XML.
830
831 keep_spaces
832 If this optional argument is set to a true value then all
833 spaces in the document are kept, and stored as "PCDATA".
834
835 Warning: adding this option can result in changes in the twig
836 generated: space that was previously discarded might end up in
837 a new text element. see the difference by calling the following
838 code with 0 and 1 as arguments:
839
840 perl -MXML::Twig -e'print XML::Twig->new( keep_spaces => shift)->parse( "<d> \n<e/></d>")->_dump'
841
842 "keep_spaces" and "discard_spaces" cannot be both set.
843
844 discard_spaces_in
845 This argument sets "keep_spaces" to true but will cause the
846 twig builder to discard spaces in the elements listed.
847
848 The syntax for using this argument is:
849
850 XML::Twig->new( discard_spaces_in => [ 'elt1', 'elt2']);
851
852 keep_spaces_in
853 This argument sets "discard_spaces" to true but will cause the
854 twig builder to keep spaces in the elements listed.
855
856 The syntax for using this argument is:
857
858 XML::Twig->new( keep_spaces_in => [ 'elt1', 'elt2']);
859
860 Warning: adding this option can result in changes in the twig
861 generated: space that was previously discarded might end up in
862 a new text element.
863
864 pretty_print
865 Set the pretty print method, amongst '"none"' (default),
866 '"nsgmls"', '"nice"', '"indented"', '"indented_c"',
867 '"indented_a"', '"indented_close_tag"', '"cvs"', '"wrapped"',
868 '"record"' and '"record_c"'
869
870 pretty_print formats:
871
872 none
873 The document is output as one ling string, with no line
874 breaks except those found within text elements
875
876 nsgmls
877 Line breaks are inserted in safe places: that is within
878 tags, between a tag and an attribute, between attributes
879 and before the > at the end of a tag.
880
881 This is quite ugly but better than "none", and it is very
882 safe, the document will still be valid (conforming to its
883 DTD).
884
885 This is how the SGML parser "sgmls" splits documents, hence
886 the name.
887
888 nice
889 This option inserts line breaks before any tag that does
890 not contain text (so element with textual content are not
891 broken as the \n is the significant).
892
893 WARNING: this option leaves the document well-formed but
894 might make it invalid (not conformant to its DTD). If you
895 have elements declared as
896
897 <!ELEMENT foo (#PCDATA|bar)>
898
899 then a "foo" element including a "bar" one will be printed
900 as
901
902 <foo>
903 <bar>bar is just pcdata</bar>
904 </foo>
905
906 This is invalid, as the parser will take the line break
907 after the "foo" tag as a sign that the element contains
908 PCDATA, it will then die when it finds the "bar" tag. This
909 may or may not be important for you, but be aware of it!
910
911 indented
912 Same as "nice" (and with the same warning) but indents
913 elements according to their level
914
915 indented_c
916 Same as "indented" but a little more compact: the closing
917 tags are on the same line as the preceding text
918
919 indented_close_tag
920 Same as "indented" except that the closing tag is also
921 indented, to line up with the tags within the element
922
923 idented_a
924 This formats XML files in a line-oriented version control
925 friendly way. The format is described in
926 <http://tinyurl.com/2kwscq> (that's an Oracle document with
927 an insanely long URL).
928
929 Note that to be totaly conformant to the "spec", the order
930 of attributes should not be changed, so if they are not
931 already in alphabetical order you will need to use the
932 "keep_atts_order" option.
933
934 cvs Same as "idented_a".
935
936 wrapped
937 Same as "indented_c" but lines are wrapped using
938 Text::Wrap::wrap. The default length for lines is the
939 default for $Text::Wrap::columns, and can be changed by
940 changing that variable.
941
942 record
943 This is a record-oriented pretty print, that display data
944 in records, one field per line (which looks a LOT like
945 "indented")
946
947 record_c
948 Stands for record compact, one record per line
949
950 empty_tags
951 Set the empty tag display style ('"normal"', '"html"' or
952 '"expand"').
953
954 "normal" outputs an empty tag '"<tag/>"', "html" adds a space
955 '"<tag />"' for elements that can be empty in XHTML and
956 "expand" outputs '"<tag></tag>"'
957
958 quote
959 Set the quote character for attributes ('"single"' or
960 '"double"').
961
962 escape_gt
963 By default XML::Twig does not escape the character > in its
964 output, as it is not mandated by the XML spec. With this option
965 on, > will be replaced by ">"
966
967 comments
968 Set the way comments are processed: '"drop"' (default),
969 '"keep"' or '"process"'
970
971 Comments processing options:
972
973 drop
974 drops the comments, they are not read, nor printed to the
975 output
976
977 keep
978 comments are loaded and will appear on the output, they are
979 not accessible within the twig and will not interfere with
980 processing though
981
982 Note: comments in the middle of a text element such as
983
984 <p>text <!-- comment --> more text --></p>
985
986 are kept at their original position in the text. Using
987 X"print" methods like "print" or "sprint" will return the
988 comments in the text. Using "text" or "field" on the other
989 hand will not.
990
991 Any use of "set_pcdata" on the "#PCDATA" element (directly
992 or through other methods like "set_content") will delete
993 the comment(s).
994
995 process
996 comments are loaded in the twig and will be treated as
997 regular elements (their "tag" is "#COMMENT") this can
998 interfere with processing if you expect
999 "$elt->{first_child}" to be an element but find a comment
1000 there. Validation will not protect you from this as
1001 comments can happen anywhere. You can use
1002 "$elt->first_child( 'tag')" (which is a good habit anyway)
1003 to get where you want.
1004
1005 Consider using "process" if you are outputting SAX events
1006 from XML::Twig.
1007
1008 pi Set the way processing instructions are processed: '"drop"',
1009 '"keep"' (default) or '"process"'
1010
1011 Note that you can also set PI handlers in the "twig_handlers"
1012 option:
1013
1014 '?' => \&handler
1015 '?target' => \&handler 2
1016
1017 The handlers will be called with 2 parameters, the twig and the
1018 PI element if "pi" is set to "process", and with 3, the twig,
1019 the target and the data if "pi" is set to "keep". Of course
1020 they will not be called if "pi" is set to "drop".
1021
1022 If "pi" is set to "keep" the handler should return a string
1023 that will be used as-is as the PI text (it should look like ""
1024 <?target data?" >" or '' if you want to remove the PI),
1025
1026 Only one handler will be called, "?target" or "?" if no
1027 specific handler for that target is available.
1028
1029 map_xmlns
1030 This option is passed a hashref that maps uri's to prefixes.
1031 The prefixes in the document will be replaced by the ones in
1032 the map. The mapped prefixes can (actually have to) be used to
1033 trigger handlers, navigate or query the document.
1034
1035 Here is an example:
1036
1037 my $t= XML::Twig->new( map_xmlns => {'http://www.w3.org/2000/svg' => "svg"},
1038 twig_handlers =>
1039 { 'svg:circle' => sub { $_->set_att( r => 20) } },
1040 pretty_print => 'indented',
1041 )
1042 ->parse( '<doc xmlns:gr="http://www.w3.org/2000/svg">
1043 <gr:circle cx="10" cy="90" r="10"/>
1044 </doc>'
1045 )
1046 ->print;
1047
1048 This will output:
1049
1050 <doc xmlns:svg="http://www.w3.org/2000/svg">
1051 <svg:circle cx="10" cy="90" r="20"/>
1052 </doc>
1053
1054 keep_original_prefix
1055 When used with "map_xmlns" this option will make "XML::Twig"
1056 use the original namespace prefixes when outputting a document.
1057 The mapped prefix will still be used for triggering handlers
1058 and in navigation and query methods.
1059
1060 my $t= XML::Twig->new( map_xmlns => {'http://www.w3.org/2000/svg' => "svg"},
1061 twig_handlers =>
1062 { 'svg:circle' => sub { $_->set_att( r => 20) } },
1063 keep_original_prefix => 1,
1064 pretty_print => 'indented',
1065 )
1066 ->parse( '<doc xmlns:gr="http://www.w3.org/2000/svg">
1067 <gr:circle cx="10" cy="90" r="10"/>
1068 </doc>'
1069 )
1070 ->print;
1071
1072 This will output:
1073
1074 <doc xmlns:gr="http://www.w3.org/2000/svg">
1075 <gr:circle cx="10" cy="90" r="20"/>
1076 </doc>
1077
1078 original_uri ($prefix)
1079 called within a handler, this will return the uri bound to the
1080 namespace prefix in the original document.
1081
1082 index ($arrayref or $hashref)
1083 This option creates lists of specific elements during the
1084 parsing of the XML. It takes a reference to either a list of
1085 triggering expressions or to a hash name => expression, and for
1086 each one generates the list of elements that match the
1087 expression. The list can be accessed through the "index"
1088 method.
1089
1090 example:
1091
1092 # using an array ref
1093 my $t= XML::Twig->new( index => [ 'div', 'table' ])
1094 ->parsefile( "foo.xml");
1095 my $divs= $t->index( 'div');
1096 my $first_div= $divs->[0];
1097 my $last_table= $t->index( table => -1);
1098
1099 # using a hashref to name the indexes
1100 my $t= XML::Twig->new( index => { email => 'a[@href=~/^ \s*mailto:/]'})
1101 ->parsefile( "foo.xml");
1102 my $last_emails= $t->index( email => -1);
1103
1104 Note that the index is not maintained after the parsing. If
1105 elements are deleted, renamed or otherwise hurt during
1106 processing, the index is NOT updated. (changing the id element
1107 OTOH will update the index)
1108
1109 att_accessors <list of attribute names>
1110 creates methods that give direct access to attribute:
1111
1112 my $t= XML::Twig->new( att_accessors => [ 'href', 'src'])
1113 ->parsefile( $file);
1114 my $first_href= $t->first_elt( 'img')->src; # same as ->att( 'src')
1115 $t->first_elt( 'img')->src( 'new_logo.png') # changes the attribute value
1116
1117 elt_accessors
1118 creates methods that give direct access to the first child
1119 element (in scalar context) or the list of elements (in list
1120 context):
1121
1122 the list of accessors to create can be given 1 2 different
1123 ways: in an array, or in a hash alias => expression
1124 my $t= XML::Twig->new( elt_accessors => [ 'head'])
1125 ->parsefile( $file);
1126 my $title_text= $t->root->head->field( 'title');
1127 # same as $title_text= $t->root->first_child( 'head')->field(
1128 'title');
1129
1130 my $t= XML::Twig->new( elt_accessors => { warnings => 'p[@class="warning"]', d2 => 'div[2]'}, )
1131 ->parsefile( $file);
1132 my $body= $t->first_elt( 'body');
1133 my @warnings= $body->warnings; # same as $body->children( 'p[@class="warning"]');
1134 my $s2= $body->d2; # same as $body->first_child( 'div[2]')
1135
1136 field_accessors
1137 creates methods that give direct access to the first child
1138 element text:
1139
1140 my $t= XML::Twig->new( field_accessors => [ 'h1'])
1141 ->parsefile( $file);
1142 my $div_title_text= $t->first_elt( 'div')->title;
1143 # same as $title_text= $t->first_elt( 'div')->field( 'title');
1144
1145 use_tidy
1146 set this option to use HTML::Tidy instead of HTML::TreeBuilder
1147 to convert HTML to XML. HTML, especially real (real "crap")
1148 HTML found in the wild, so depending on the data, one module or
1149 the other does a better job at the conversion. Also, HTML::Tidy
1150 can be a bit difficult to install, so XML::Twig offers both
1151 option. TIMTOWTDI
1152
1153 output_html_doctype
1154 when using HTML::TreeBuilder to convert HTML, this option
1155 causes the DOCTYPE declaration to be output, which may be
1156 important for some legacy browsers. Without that option the
1157 DOCTYPE definition is NOT output. Also if the definition is
1158 completely wrong (ie not easily parsable), it is not output
1159 either.
1160
1161 Note: I _HATE_ the Java-like name of arguments used by most XML
1162 modules. So in pure TIMTOWTDI fashion all arguments can be written
1163 either as "UglyJavaLikeName" or as "readable_perl_name":
1164 "twig_print_outside_roots" or "TwigPrintOutsideRoots" (or even
1165 "twigPrintOutsideRoots" {shudder}). XML::Twig normalizes them
1166 before processing them.
1167
1168 parse ( $source)
1169 The $source parameter should either be a string containing the
1170 whole XML document, or it should be an open "IO::Handle" (aka a
1171 filehandle).
1172
1173 A die call is thrown if a parse error occurs. Otherwise it will
1174 return the twig built by the parse. Use "safe_parse" if you want
1175 the parsing to return even when an error occurs.
1176
1177 If this method is called as a class method ("XML::Twig->parse(
1178 $some_xml_or_html)") then an XML::Twig object is created, using the
1179 parameters except the last one (eg "XML::Twig->parse( pretty_print
1180 => 'indented', $some_xml_or_html)") and "xparse" is called on it.
1181
1182 Note that when parsing a filehandle, the handle should NOT be open
1183 with an encoding (ie open with "open( my $in, '<', $filename)". The
1184 file will be parsed by "expat", so specifying the encoding actually
1185 causes problems for the parser (as in: it can crash it, see
1186 https://rt.cpan.org/Ticket/Display.html?id=78877). For parsing a
1187 file it is actually recommended to use "parsefile" on the file
1188 name, instead of <parse> on the open file.
1189
1190 parsestring
1191 This is just an alias for "parse" for backwards compatibility.
1192
1193 parsefile (FILE [, OPT => OPT_VALUE [...]])
1194 Open "FILE" for reading, then call "parse" with the open handle.
1195 The file is closed no matter how "parse" returns.
1196
1197 A "die" call is thrown if a parse error occurs. Otherwise it will
1198 return the twig built by the parse. Use "safe_parsefile" if you
1199 want the parsing to return even when an error occurs.
1200
1201 parsefile_inplace ( $file, $optional_extension)
1202 Parse and update a file "in place". It does this by creating a temp
1203 file, selecting it as the default for print() statements (and
1204 methods), then parsing the input file. If the parsing is
1205 successful, then the temp file is moved to replace the input file.
1206
1207 If an extension is given then the original file is backed-up (the
1208 rules for the extension are the same as the rule for the -i option
1209 in perl).
1210
1211 parsefile_html_inplace ( $file, $optional_extension)
1212 Same as parsefile_inplace, except that it parses HTML instead of
1213 XML
1214
1215 parseurl ($url $optional_user_agent)
1216 Gets the data from $url and parse it. The data is piped to the
1217 parser in chunks the size of the XML::Parser::Expat buffer, so
1218 memory consumption and hopefully speed are optimal.
1219
1220 For most (read "small") XML it is probably as efficient (and easier
1221 to debug) to just "get" the XML file and then parse it as a string.
1222
1223 use XML::Twig;
1224 use LWP::Simple;
1225 my $twig= XML::Twig->new();
1226 $twig->parse( LWP::Simple::get( $URL ));
1227
1228 or
1229
1230 use XML::Twig;
1231 my $twig= XML::Twig->nparse( $URL);
1232
1233 If the $optional_user_agent argument is used then it is used,
1234 otherwise a new one is created.
1235
1236 safe_parse ( SOURCE [, OPT => OPT_VALUE [...]])
1237 This method is similar to "parse" except that it wraps the parsing
1238 in an "eval" block. It returns the twig on success and 0 on failure
1239 (the twig object also contains the parsed twig). $@ contains the
1240 error message on failure.
1241
1242 Note that the parsing still stops as soon as an error is detected,
1243 there is no way to keep going after an error.
1244
1245 safe_parsefile (FILE [, OPT => OPT_VALUE [...]])
1246 This method is similar to "parsefile" except that it wraps the
1247 parsing in an "eval" block. It returns the twig on success and 0 on
1248 failure (the twig object also contains the parsed twig) . $@
1249 contains the error message on failure
1250
1251 Note that the parsing still stops as soon as an error is detected,
1252 there is no way to keep going after an error.
1253
1254 safe_parseurl ($url $optional_user_agent)
1255 Same as "parseurl" except that it wraps the parsing in an "eval"
1256 block. It returns the twig on success and 0 on failure (the twig
1257 object also contains the parsed twig) . $@ contains the error
1258 message on failure
1259
1260 parse_html ($string_or_fh)
1261 parse an HTML string or file handle (by converting it to XML using
1262 HTML::TreeBuilder, which needs to be available).
1263
1264 This works nicely, but some information gets lost in the process:
1265 newlines are removed, and (at least on the version I use), comments
1266 get get an extra CDATA section inside ( <!-- foo --> becomes <!--
1267 <![CDATA[ foo ]]> -->
1268
1269 parsefile_html ($file)
1270 parse an HTML file (by converting it to XML using
1271 HTML::TreeBuilder, which needs to be available, or HTML::Tidy if
1272 the "use_tidy" option was used). The file is loaded completely in
1273 memory and converted to XML before being parsed.
1274
1275 this method is to be used with caution though, as it doesn't know
1276 about the file encoding, it is usually better to use "parse_html",
1277 which gives you a chance to open the file with the proper encoding
1278 layer.
1279
1280 parseurl_html ($url $optional_user_agent)
1281 parse an URL as html the same way "parse_html" does
1282
1283 safe_parseurl_html ($url $optional_user_agent)
1284 Same as "parseurl_html"> except that it wraps the parsing in an
1285 "eval" block. It returns the twig on success and 0 on failure (the
1286 twig object also contains the parsed twig) . $@ contains the error
1287 message on failure
1288
1289 safe_parsefile_html ($file $optional_user_agent)
1290 Same as "parsefile_html"> except that it wraps the parsing in an
1291 "eval" block. It returns the twig on success and 0 on failure (the
1292 twig object also contains the parsed twig) . $@ contains the error
1293 message on failure
1294
1295 safe_parse_html ($string_or_fh)
1296 Same as "parse_html" except that it wraps the parsing in an "eval"
1297 block. It returns the twig on success and 0 on failure (the twig
1298 object also contains the parsed twig) . $@ contains the error
1299 message on failure
1300
1301 xparse ($thing_to_parse)
1302 parse the $thing_to_parse, whether it is a filehandle, a string, an
1303 HTML file, an HTML URL, an URL or a file.
1304
1305 Note that this is mostly a convenience method for one-off scripts.
1306 For example files that end in '.htm' or '.html' are parsed first as
1307 XML, and if this fails as HTML. This is certainly not the most
1308 efficient way to do this in general.
1309
1310 nparse ($optional_twig_options, $thing_to_parse)
1311 create a twig with the $optional_options, and parse the
1312 $thing_to_parse, whether it is a filehandle, a string, an HTML
1313 file, an HTML URL, an URL or a file.
1314
1315 Examples:
1316
1317 XML::Twig->nparse( "file.xml");
1318 XML::Twig->nparse( error_context => 1, "file://file.xml");
1319
1320 nparse_pp ($optional_twig_options, $thing_to_parse)
1321 same as "nparse" but also sets the "pretty_print" option to
1322 "indented".
1323
1324 nparse_e ($optional_twig_options, $thing_to_parse)
1325 same as "nparse" but also sets the "error_context" option to 1.
1326
1327 nparse_ppe ($optional_twig_options, $thing_to_parse)
1328 same as "nparse" but also sets the "pretty_print" option to
1329 "indented" and the "error_context" option to 1.
1330
1331 parser
1332 This method returns the "expat" object (actually the
1333 XML::Parser::Expat object) used during parsing. It is useful for
1334 example to call XML::Parser::Expat methods on it. To get the line
1335 of a tag for example use "$t->parser->current_line".
1336
1337 setTwigHandlers ($handlers)
1338 Set the twig_handlers. $handlers is a reference to a hash similar
1339 to the one in the "twig_handlers" option of new. All previous
1340 handlers are unset. The method returns the reference to the
1341 previous handlers.
1342
1343 setTwigHandler ($exp $handler)
1344 Set a single twig_handler for elements matching $exp. $handler is a
1345 reference to a subroutine. If the handler was previously set then
1346 the reference to the previous handler is returned.
1347
1348 setStartTagHandlers ($handlers)
1349 Set the start_tag handlers. $handlers is a reference to a hash
1350 similar to the one in the "start_tag_handlers" option of new. All
1351 previous handlers are unset. The method returns the reference to
1352 the previous handlers.
1353
1354 setStartTagHandler ($exp $handler)
1355 Set a single start_tag handlers for elements matching $exp.
1356 $handler is a reference to a subroutine. If the handler was
1357 previously set then the reference to the previous handler is
1358 returned.
1359
1360 setEndTagHandlers ($handlers)
1361 Set the end_tag handlers. $handlers is a reference to a hash
1362 similar to the one in the "end_tag_handlers" option of new. All
1363 previous handlers are unset. The method returns the reference to
1364 the previous handlers.
1365
1366 setEndTagHandler ($exp $handler)
1367 Set a single end_tag handlers for elements matching $exp. $handler
1368 is a reference to a subroutine. If the handler was previously set
1369 then the reference to the previous handler is returned.
1370
1371 setTwigRoots ($handlers)
1372 Same as using the "twig_roots" option when creating the twig
1373
1374 setCharHandler ($exp $handler)
1375 Set a "char_handler"
1376
1377 setIgnoreEltsHandler ($exp)
1378 Set a "ignore_elt" handler (elements that match $exp will be
1379 ignored
1380
1381 setIgnoreEltsHandlers ($exp)
1382 Set all "ignore_elt" handlers (previous handlers are replaced)
1383
1384 dtd Return the dtd (an XML::Twig::DTD object) of a twig
1385
1386 xmldecl
1387 Return the XML declaration for the document, or a default one if it
1388 doesn't have one
1389
1390 doctype
1391 Return the doctype for the document
1392
1393 doctype_name
1394 returns the doctype of the document from the doctype declaration
1395
1396 system_id
1397 returns the system value of the DTD of the document from the
1398 doctype declaration
1399
1400 public_id
1401 returns the public doctype of the document from the doctype
1402 declaration
1403
1404 internal_subset
1405 returns the internal subset of the DTD
1406
1407 dtd_text
1408 Return the DTD text
1409
1410 dtd_print
1411 Print the DTD
1412
1413 model ($tag)
1414 Return the model (in the DTD) for the element $tag
1415
1416 root
1417 Return the root element of a twig
1418
1419 set_root ($elt)
1420 Set the root of a twig
1421
1422 first_elt ($optional_condition)
1423 Return the first element matching $optional_condition of a twig, if
1424 no condition is given then the root is returned
1425
1426 last_elt ($optional_condition)
1427 Return the last element matching $optional_condition of a twig, if
1428 no condition is given then the last element of the twig is returned
1429
1430 elt_id ($id)
1431 Return the element whose "id" attribute is $id
1432
1433 getEltById
1434 Same as "elt_id"
1435
1436 index ($index_name, $optional_index)
1437 If the $optional_index argument is present, return the
1438 corresponding element in the index (created using the "index"
1439 option for "XML::Twig-"new>)
1440
1441 If the argument is not present, return an arrayref to the index
1442
1443 normalize
1444 merge together all consecutive pcdata elements in the document (if
1445 for example you have turned some elements into pcdata using
1446 "erase", this will give you a "clean" document in which there all
1447 text elements are as long as possible).
1448
1449 encoding
1450 This method returns the encoding of the XML document, as defined by
1451 the "encoding" attribute in the XML declaration (ie it is "undef"
1452 if the attribute is not defined)
1453
1454 set_encoding
1455 This method sets the value of the "encoding" attribute in the XML
1456 declaration. Note that if the document did not have a declaration
1457 it is generated (with an XML version of 1.0)
1458
1459 xml_version
1460 This method returns the XML version, as defined by the "version"
1461 attribute in the XML declaration (ie it is "undef" if the attribute
1462 is not defined)
1463
1464 set_xml_version
1465 This method sets the value of the "version" attribute in the XML
1466 declaration. If the declaration did not exist it is created.
1467
1468 standalone
1469 This method returns the value of the "standalone" declaration for
1470 the document
1471
1472 set_standalone
1473 This method sets the value of the "standalone" attribute in the XML
1474 declaration. Note that if the document did not have a declaration
1475 it is generated (with an XML version of 1.0)
1476
1477 set_output_encoding
1478 Set the "encoding" "attribute" in the XML declaration
1479
1480 set_doctype ($name, $system, $public, $internal)
1481 Set the doctype of the element. If an argument is "undef" (or not
1482 present) then its former value is retained, if a false ('' or 0)
1483 value is passed then the former value is deleted;
1484
1485 entity_list
1486 Return the entity list of a twig
1487
1488 entity_names
1489 Return the list of all defined entities
1490
1491 entity ($entity_name)
1492 Return the entity
1493
1494 change_gi ($old_gi, $new_gi)
1495 Performs a (very fast) global change. All elements $old_gi are now
1496 $new_gi. This is a bit dangerous though and should be avoided if <
1497 possible, as the new tag might be ignored in subsequent processing.
1498
1499 See "BUGS "
1500
1501 flush ($optional_filehandle, %options)
1502 Flushes a twig up to (and including) the current element, then
1503 deletes all unnecessary elements from the tree that's kept in
1504 memory. "flush" keeps track of which elements need to be
1505 open/closed, so if you flush from handlers you don't have to worry
1506 about anything. Just keep flushing the twig every time you're done
1507 with a sub-tree and it will come out well-formed. After the whole
1508 parsing don't forget to"flush" one more time to print the end of
1509 the document. The doctype and entity declarations are also
1510 printed.
1511
1512 flush take an optional filehandle as an argument.
1513
1514 If you use "flush" at any point during parsing, the document will
1515 be flushed one last time at the end of the parsing, to the proper
1516 filehandle.
1517
1518 options: use the "update_DTD" option if you have updated the
1519 (internal) DTD and/or the entity list and you want the updated DTD
1520 to be output
1521
1522 The "pretty_print" option sets the pretty printing of the document.
1523
1524 Example: $t->flush( Update_DTD => 1);
1525 $t->flush( $filehandle, pretty_print => 'indented');
1526 $t->flush( \*FILE);
1527
1528 flush_up_to ($elt, $optional_filehandle, %options)
1529 Flushes up to the $elt element. This allows you to keep part of the
1530 tree in memory when you "flush".
1531
1532 options: see flush.
1533
1534 purge
1535 Does the same as a "flush" except it does not print the twig. It
1536 just deletes all elements that have been completely parsed so far.
1537
1538 purge_up_to ($elt)
1539 Purges up to the $elt element. This allows you to keep part of the
1540 tree in memory when you "purge".
1541
1542 print ($optional_filehandle, %options)
1543 Prints the whole document associated with the twig. To be used only
1544 AFTER the parse.
1545
1546 options: see "flush".
1547
1548 print_to_file ($filename, %options)
1549 Prints the whole document associated with the twig to file
1550 $filename. To be used only AFTER the parse.
1551
1552 options: see "flush".
1553
1554 sprint
1555 Return the text of the whole document associated with the twig. To
1556 be used only AFTER the parse.
1557
1558 options: see "flush".
1559
1560 trim
1561 Trim the document: gets rid of initial and trailing spaces, and
1562 replaces multiple spaces by a single one.
1563
1564 toSAX1 ($handler)
1565 Send SAX events for the twig to the SAX1 handler $handler
1566
1567 toSAX2 ($handler)
1568 Send SAX events for the twig to the SAX2 handler $handler
1569
1570 flush_toSAX1 ($handler)
1571 Same as flush, except that SAX events are sent to the SAX1 handler
1572 $handler instead of the twig being printed
1573
1574 flush_toSAX2 ($handler)
1575 Same as flush, except that SAX events are sent to the SAX2 handler
1576 $handler instead of the twig being printed
1577
1578 ignore
1579 This method should be called during parsing, usually in
1580 "start_tag_handlers". It causes the element to be skipped during
1581 the parsing: the twig is not built for this element, it will not be
1582 accessible during parsing or after it. The element will not take up
1583 any memory and parsing will be faster.
1584
1585 Note that this method can also be called on an element. If the
1586 element is a parent of the current element then this element will
1587 be ignored (the twig will not be built any more for it and what has
1588 already been built will be deleted).
1589
1590 set_pretty_print ($style)
1591 Set the pretty print method, amongst '"none"' (default),
1592 '"nsgmls"', '"nice"', '"indented"', "indented_c", '"wrapped"',
1593 '"record"' and '"record_c"'
1594
1595 WARNING: the pretty print style is a GLOBAL variable, so once set
1596 it's applied to ALL "print"'s (and "sprint"'s). Same goes if you
1597 use XML::Twig with "mod_perl" . This should not be a problem as the
1598 XML that's generated is valid anyway, and XML processors (as well
1599 as HTML processors, including browsers) should not care. Let me
1600 know if this is a big problem, but at the moment the
1601 performance/cleanliness trade-off clearly favors the global
1602 approach.
1603
1604 set_empty_tag_style ($style)
1605 Set the empty tag display style ('"normal"', '"html"' or
1606 '"expand"'). As with "set_pretty_print" this sets a global flag.
1607
1608 "normal" outputs an empty tag '"<tag/>"', "html" adds a space
1609 '"<tag />"' for elements that can be empty in XHTML and "expand"
1610 outputs '"<tag></tag>"'
1611
1612 set_remove_cdata ($flag)
1613 set (or unset) the flag that forces the twig to output CDATA
1614 sections as regular (escaped) PCDATA
1615
1616 print_prolog ($optional_filehandle, %options)
1617 Prints the prolog (XML declaration + DTD + entity declarations) of
1618 a document.
1619
1620 options: see "flush".
1621
1622 prolog ($optional_filehandle, %options)
1623 Return the prolog (XML declaration + DTD + entity declarations) of
1624 a document.
1625
1626 options: see "flush".
1627
1628 finish
1629 Call Expat "finish" method. Unsets all handlers (including
1630 internal ones that set context), but expat continues parsing to the
1631 end of the document or until it finds an error. It should finish
1632 up a lot faster than with the handlers set.
1633
1634 finish_print
1635 Stops twig processing, flush the twig and proceed to finish
1636 printing the document as fast as possible. Use this method when
1637 modifying a document and the modification is done.
1638
1639 finish_now
1640 Stops twig processing, does not finish parsing the document (which
1641 could actually be not well-formed after the point where
1642 "finish_now" is called). Execution resumes after the "Lparse"> or
1643 "parsefile" call. The content of the twig is what has been parsed
1644 so far (all open elements at the time "finish_now" is called are
1645 considered closed).
1646
1647 set_expand_external_entities
1648 Same as using the "expand_external_ents" option when creating the
1649 twig
1650
1651 set_input_filter
1652 Same as using the "input_filter" option when creating the twig
1653
1654 set_keep_atts_order
1655 Same as using the "keep_atts_order" option when creating the twig
1656
1657 set_keep_encoding
1658 Same as using the "keep_encoding" option when creating the twig
1659
1660 escape_gt
1661 usually XML::Twig does not escape > in its output. Using this
1662 option makes it replace > by >
1663
1664 do_not_escape_gt
1665 reverts XML::Twig behavior to its default of not escaping > in its
1666 output.
1667
1668 set_output_filter
1669 Same as using the "output_filter" option when creating the twig
1670
1671 set_output_text_filter
1672 Same as using the "output_text_filter" option when creating the
1673 twig
1674
1675 add_stylesheet ($type, @options)
1676 Adds an external stylesheet to an XML document.
1677
1678 Supported types and options:
1679
1680 xsl option: the url of the stylesheet
1681
1682 Example:
1683
1684 $t->add_stylesheet( xsl => "xsl_style.xsl");
1685
1686 will generate the following PI at the beginning of the
1687 document:
1688
1689 <?xml-stylesheet type="text/xsl" href="xsl_style.xsl"?>
1690
1691 css option: the url of the stylesheet
1692
1693 active_twig
1694 a class method that returns the last processed twig, so you
1695 don't necessarily need the object to call methods on it.
1696
1697 Methods inherited from XML::Parser::Expat
1698 A twig inherits all the relevant methods from XML::Parser::Expat.
1699 These methods can only be used during the parsing phase (they will
1700 generate a fatal error otherwise).
1701
1702 Inherited methods are:
1703
1704 depth
1705 Returns the size of the context list.
1706
1707 in_element
1708 Returns true if NAME is equal to the name of the innermost curX
1709 rently opened element. If namespace processing is being used
1710 and you want to check against a name that may be in a
1711 namespace, then use the generate_ns_name method to create the
1712 NAME argument.
1713
1714 within_element
1715 Returns the number of times the given name appears in the
1716 context list. If namespace processing is being used and you
1717 want to check against a name that may be in a namespace, then
1718 use the generX ate_ns_name method to create the NAME argument.
1719
1720 context
1721 Returns a list of element names that represent open elements,
1722 with the last one being the innermost. Inside start and end tag
1723 hanX dlers, this will be the tag of the parent element.
1724
1725 current_line
1726 Returns the line number of the current position of the parse.
1727
1728 current_column
1729 Returns the column number of the current position of the parse.
1730
1731 current_byte
1732 Returns the current position of the parse.
1733
1734 position_in_context
1735 Returns a string that shows the current parse position. LINES
1736 should be an integer >= 0 that represents the number of lines
1737 on either side of the current parse line to place into the
1738 returned string.
1739
1740 base ([NEWBASE])
1741 Returns the current value of the base for resolving relative
1742 URIs. If NEWBASE is supplied, changes the base to that value.
1743
1744 current_element
1745 Returns the name of the innermost currently opened element.
1746 Inside start or end handlers, returns the parent of the element
1747 associated with those tags.
1748
1749 element_index
1750 Returns an integer that is the depth-first visit order of the
1751 curX rent element. This will be zero outside of the root
1752 element. For example, this will return 1 when called from the
1753 start handler for the root element start tag.
1754
1755 recognized_string
1756 Returns the string from the document that was recognized in
1757 order to call the current handler. For instance, when called
1758 from a start handler, it will give us the the start-tag string.
1759 The string is encoded in UTF-8. This method doesn't return a
1760 meaningful string inside declaration handlers.
1761
1762 original_string
1763 Returns the verbatim string from the document that was
1764 recognized in order to call the current handler. The string is
1765 in the original document encoding. This method doesn't return a
1766 meaningful string inside declaration handlers.
1767
1768 xpcroak
1769 Concatenate onto the given message the current line number
1770 within the XML document plus the message implied by
1771 ErrorContext. Then croak with the formed message.
1772
1773 xpcarp
1774 Concatenate onto the given message the current line number
1775 within the XML document plus the message implied by
1776 ErrorContext. Then carp with the formed message.
1777
1778 xml_escape(TEXT [, CHAR [, CHAR ...]])
1779 Returns TEXT with markup characters turned into character
1780 entities. Any additional characters provided as arguments are
1781 also turned into character references where found in TEXT.
1782
1783 (this method is broken on some versions of expat/XML::Parser)
1784
1785 path ( $optional_tag)
1786 Return the element context in a form similar to XPath's short form:
1787 '"/root/tag1/../tag"'
1788
1789 get_xpath ( $optional_array_ref, $xpath, $optional_offset)
1790 Performs a "get_xpath" on the document root (see <Elt|"Elt">)
1791
1792 If the $optional_array_ref argument is used the array must contain
1793 elements. The $xpath expression is applied to each element in turn
1794 and the result is union of all results. This way a first query can
1795 be refined in further steps.
1796
1797 find_nodes ( $optional_array_ref, $xpath, $optional_offset)
1798 same as "get_xpath"
1799
1800 findnodes ( $optional_array_ref, $xpath, $optional_offset)
1801 same as "get_xpath" (similar to the XML::LibXML method)
1802
1803 findvalue ( $optional_array_ref, $xpath, $optional_offset)
1804 Return the "join" of all texts of the results of applying
1805 "get_xpath" to the node (similar to the XML::LibXML method)
1806
1807 findvalues ( $optional_array_ref, $xpath, $optional_offset)
1808 Return an array of all texts of the results of applying "get_xpath"
1809 to the node
1810
1811 subs_text ($regexp, $replace)
1812 subs_text does text substitution on the whole document, similar to
1813 perl's " s///" operator.
1814
1815 dispose
1816 Useful only if you don't have "Scalar::Util" or "WeakRef"
1817 installed.
1818
1819 Reclaims properly the memory used by an XML::Twig object. As the
1820 object has circular references it never goes out of scope, so if
1821 you want to parse lots of XML documents then the memory leak
1822 becomes a problem. Use "$twig->dispose" to clear this problem.
1823
1824 att_accessors (list_of_attribute_names)
1825 A convenience method that creates l-valued accessors for
1826 attributes. So "$twig->create_accessors( 'foo')" will create a
1827 "foo" method that can be called on elements:
1828
1829 $elt->foo; # equivalent to $elt->{'att'}->{'foo'};
1830 $elt->foo( 'bar'); # equivalent to $elt->set_att( foo => 'bar');
1831
1832 The methods are l-valued only under those perl's that support this
1833 feature (5.6 and above)
1834
1835 create_accessors (list_of_attribute_names)
1836 Same as att_accessors
1837
1838 elt_accessors (list_of_attribute_names)
1839 A convenience method that creates accessors for elements. So
1840 "$twig->create_accessors( 'foo')" will create a "foo" method that
1841 can be called on elements:
1842
1843 $elt->foo; # equivalent to $elt->first_child( 'foo');
1844
1845 field_accessors (list_of_attribute_names)
1846 A convenience method that creates accessors for element values
1847 ("field"). So "$twig->create_accessors( 'foo')" will create a
1848 "foo" method that can be called on elements:
1849
1850 $elt->foo; # equivalent to $elt->field( 'foo');
1851
1852 set_do_not_escape_amp_in_atts
1853 An evil method, that I only document because Test::Pod::Coverage
1854 complaints otherwise, but really, you don't want to know about it.
1855
1856 XML::Twig::Elt
1857 new ($optional_tag, $optional_atts, @optional_content)
1858 The "tag" is optional (but then you can't have a content ), the
1859 $optional_atts argument is a reference to a hash of attributes, the
1860 content can be just a string or a list of strings and element. A
1861 content of '"#EMPTY"' creates an empty element;
1862
1863 Examples: my $elt= XML::Twig::Elt->new();
1864 my $elt= XML::Twig::Elt->new( para => { align => 'center' });
1865 my $elt= XML::Twig::Elt->new( para => { align => 'center' }, 'foo');
1866 my $elt= XML::Twig::Elt->new( br => '#EMPTY');
1867 my $elt= XML::Twig::Elt->new( 'para');
1868 my $elt= XML::Twig::Elt->new( para => 'this is a para');
1869 my $elt= XML::Twig::Elt->new( para => $elt3, 'another para');
1870
1871 The strings are not parsed, the element is not attached to any
1872 twig.
1873
1874 WARNING: if you rely on ID's then you will have to set the id
1875 yourself. At this point the element does not belong to a twig yet,
1876 so the ID attribute is not known so it won't be stored in the ID
1877 list.
1878
1879 Note that "#COMMENT", "#PCDATA" or "#CDATA" are valid tag names,
1880 that will create text elements.
1881
1882 To create an element "foo" containing a CDATA section:
1883
1884 my $foo= XML::Twig::Elt->new( '#CDATA' => "content of the CDATA section")
1885 ->wrap_in( 'foo');
1886
1887 An attribute of '#CDATA', will create the content of the element as
1888 CDATA:
1889
1890 my $elt= XML::Twig::Elt->new( 'p' => { '#CDATA' => 1}, 'foo < bar');
1891
1892 creates an element
1893
1894 <p><![CDATA[foo < bar]]></>
1895
1896 parse ($string, %args)
1897 Creates an element from an XML string. The string is actually
1898 parsed as a new twig, then the root of that twig is returned. The
1899 arguments in %args are passed to the twig. As always if the parse
1900 fails the parser will die, so use an eval if you want to trap
1901 syntax errors.
1902
1903 As obviously the element does not exist beforehand this method has
1904 to be called on the class:
1905
1906 my $elt= parse XML::Twig::Elt( "<a> string to parse, with <sub/>
1907 <elements>, actually tons of </elements>
1908 h</a>");
1909
1910 set_inner_xml ($string)
1911 Sets the content of the element to be the tree created from the
1912 string
1913
1914 set_inner_html ($string)
1915 Sets the content of the element, after parsing the string with an
1916 HTML parser (HTML::Parser)
1917
1918 set_outer_xml ($string)
1919 Replaces the element with the tree created from the string
1920
1921 print ($optional_filehandle, $optional_pretty_print_style)
1922 Prints an entire element, including the tags, optionally to a
1923 $optional_filehandle, optionally with a $pretty_print_style.
1924
1925 The print outputs XML data so base entities are escaped.
1926
1927 print_to_file ($filename, %options)
1928 Prints the element to file $filename.
1929
1930 options: see "flush". =item sprint ($elt,
1931 $optional_no_enclosing_tag)
1932
1933 Return the xml string for an entire element, including the tags.
1934 If the optional second argument is true then only the string inside
1935 the element is returned (the start and end tag for $elt are not).
1936 The text is XML-escaped: base entities (& and < in text, & < and "
1937 in attribute values) are turned into entities.
1938
1939 gi Return the gi of the element (the gi is the "generic identifier"
1940 the tag name in SGML parlance).
1941
1942 "tag" and "name" are synonyms of "gi".
1943
1944 tag Same as "gi"
1945
1946 name
1947 Same as "tag"
1948
1949 set_gi ($tag)
1950 Set the gi (tag) of an element
1951
1952 set_tag ($tag)
1953 Set the tag (="tag") of an element
1954
1955 set_name ($name)
1956 Set the name (="tag") of an element
1957
1958 root
1959 Return the root of the twig in which the element is contained.
1960
1961 twig
1962 Return the twig containing the element.
1963
1964 parent ($optional_condition)
1965 Return the parent of the element, or the first ancestor matching
1966 the $optional_condition
1967
1968 first_child ($optional_condition)
1969 Return the first child of the element, or the first child matching
1970 the $optional_condition
1971
1972 has_child ($optional_condition)
1973 Return the first child of the element, or the first child matching
1974 the $optional_condition (same as first_child)
1975
1976 has_children ($optional_condition)
1977 Return the first child of the element, or the first child matching
1978 the $optional_condition (same as first_child)
1979
1980 first_child_text ($optional_condition)
1981 Return the text of the first child of the element, or the first
1982 child
1983 matching the $optional_condition If there is no first_child then
1984 returns ''. This avoids getting the child, checking for its
1985 existence then getting the text for trivial cases.
1986
1987 Similar methods are available for the other navigation methods:
1988
1989 last_child_text
1990 prev_sibling_text
1991 next_sibling_text
1992 prev_elt_text
1993 next_elt_text
1994 child_text
1995 parent_text
1996
1997 All this methods also exist in "trimmed" variant:
1998
1999 first_child_trimmed_text
2000 last_child_trimmed_text
2001 prev_sibling_trimmed_text
2002 next_sibling_trimmed_text
2003 prev_elt_trimmed_text
2004 next_elt_trimmed_text
2005 child_trimmed_text
2006 parent_trimmed_text
2007 field ($condition)
2008 Same method as "first_child_text" with a different name
2009
2010 fields ($condition_list)
2011 Return the list of field (text of first child matching the
2012 conditions), missing fields are returned as the empty string.
2013
2014 Same method as "first_child_text" with a different name
2015
2016 trimmed_field ($optional_condition)
2017 Same method as "first_child_trimmed_text" with a different name
2018
2019 set_field ($condition, $optional_atts, @list_of_elt_and_strings)
2020 Set the content of the first child of the element that matches
2021 $condition, the rest of the arguments is the same as for
2022 "set_content"
2023
2024 If no child matches $condition _and_ if $condition is a valid XML
2025 element name, then a new element by that name is created and
2026 inserted as the last child.
2027
2028 first_child_matches ($optional_condition)
2029 Return the element if the first child of the element (if it exists)
2030 passes the $optional_condition "undef" otherwise
2031
2032 if( $elt->first_child_matches( 'title')) ...
2033
2034 is equivalent to
2035
2036 if( $elt->{first_child} && $elt->{first_child}->passes( 'title'))
2037
2038 "first_child_is" is an other name for this method
2039
2040 Similar methods are available for the other navigation methods:
2041
2042 last_child_matches
2043 prev_sibling_matches
2044 next_sibling_matches
2045 prev_elt_matches
2046 next_elt_matches
2047 child_matches
2048 parent_matches
2049 is_first_child ($optional_condition)
2050 returns true (the element) if the element is the first child of its
2051 parent (optionally that satisfies the $optional_condition)
2052
2053 is_last_child ($optional_condition)
2054 returns true (the element) if the element is the last child of its
2055 parent (optionally that satisfies the $optional_condition)
2056
2057 prev_sibling ($optional_condition)
2058 Return the previous sibling of the element, or the previous sibling
2059 matching $optional_condition
2060
2061 next_sibling ($optional_condition)
2062 Return the next sibling of the element, or the first one matching
2063 $optional_condition.
2064
2065 next_elt ($optional_elt, $optional_condition)
2066 Return the next elt (optionally matching $optional_condition) of
2067 the element. This is defined as the next element which opens after
2068 the current element opens. Which usually means the first child of
2069 the element. Counter-intuitive as it might look this allows you to
2070 loop through the whole document by starting from the root.
2071
2072 The $optional_elt is the root of a subtree. When the "next_elt" is
2073 out of the subtree then the method returns undef. You can then walk
2074 a sub-tree with:
2075
2076 my $elt= $subtree_root;
2077 while( $elt= $elt->next_elt( $subtree_root))
2078 { # insert processing code here
2079 }
2080
2081 prev_elt ($optional_condition)
2082 Return the previous elt (optionally matching $optional_condition)
2083 of the element. This is the first element which opens before the
2084 current one. It is usually either the last descendant of the
2085 previous sibling or simply the parent
2086
2087 next_n_elt ($offset, $optional_condition)
2088 Return the $offset-th element that matches the $optional_condition
2089
2090 following_elt
2091 Return the following element (as per the XPath following axis)
2092
2093 preceding_elt
2094 Return the preceding element (as per the XPath preceding axis)
2095
2096 following_elts
2097 Return the list of following elements (as per the XPath following
2098 axis)
2099
2100 preceding_elts
2101 Return the pst of preceding elements (as per the XPath preceding
2102 axis)
2103
2104 children ($optional_condition)
2105 Return the list of children (optionally which matches
2106 $optional_condition) of the element. The list is in document order.
2107
2108 children_count ($optional_condition)
2109 Return the number of children of the element (optionally which
2110 matches $optional_condition)
2111
2112 children_text ($optional_condition)
2113 In array context, reeturns an array containing the text of children
2114 of the element (optionally which matches $optional_condition)
2115
2116 In scalar context, returns the concatenation of the text of
2117 children of the element
2118
2119 children_trimmed_text ($optional_condition)
2120 In array context, returns an array containing the trimmed text of
2121 children of the element (optionally which matches
2122 $optional_condition)
2123
2124 In scalar context, returns the concatenation of the trimmed text of
2125 children of the element
2126
2127 children_copy ($optional_condition)
2128 Return a list of elements that are copies of the children of the
2129 element, optionally which matches $optional_condition
2130
2131 descendants ($optional_condition)
2132 Return the list of all descendants (optionally which matches
2133 $optional_condition) of the element. This is the equivalent of the
2134 "getElementsByTagName" of the DOM (by the way, if you are really a
2135 DOM addict, you can use "getElementsByTagName" instead)
2136
2137 getElementsByTagName ($optional_condition)
2138 Same as "descendants"
2139
2140 find_by_tag_name ($optional_condition)
2141 Same as "descendants"
2142
2143 descendants_or_self ($optional_condition)
2144 Same as "descendants" except that the element itself is included in
2145 the list if it matches the $optional_condition
2146
2147 first_descendant ($optional_condition)
2148 Return the first descendant of the element that matches the
2149 condition
2150
2151 last_descendant ($optional_condition)
2152 Return the last descendant of the element that matches the
2153 condition
2154
2155 ancestors ($optional_condition)
2156 Return the list of ancestors (optionally matching
2157 $optional_condition) of the element. The list is ordered from the
2158 innermost ancestor to the outermost one
2159
2160 NOTE: the element itself is not part of the list, in order to
2161 include it you will have to use ancestors_or_self
2162
2163 ancestors_or_self ($optional_condition)
2164 Return the list of ancestors (optionally matching
2165 $optional_condition) of the element, including the element (if it
2166 matches the condition>). The list is ordered from the innermost
2167 ancestor to the outermost one
2168
2169 passes ($condition)
2170 Return the element if it passes the $condition
2171
2172 att ($att)
2173 Return the value of attribute $att or "undef"
2174
2175 latt ($att)
2176 Return the value of attribute $att or "undef"
2177
2178 this method is an lvalue, so you can do "$elt->latt( 'foo')= 'bar'"
2179 or "$elt->latt( 'foo')++;"
2180
2181 set_att ($att, $att_value)
2182 Set the attribute of the element to the given value
2183
2184 You can actually set several attributes this way:
2185
2186 $elt->set_att( att1 => "val1", att2 => "val2");
2187
2188 del_att ($att)
2189 Delete the attribute for the element
2190
2191 You can actually delete several attributes at once:
2192
2193 $elt->del_att( 'att1', 'att2', 'att3');
2194
2195 att_exists ($att)
2196 Returns true if the attribute $att exists for the element, false
2197 otherwise
2198
2199 cut Cut the element from the tree. The element still exists, it can be
2200 copied or pasted somewhere else, it is just not attached to the
2201 tree anymore.
2202
2203 Note that the "old" links to the parent, previous and next siblings
2204 can still be accessed using the former_* methods
2205
2206 former_next_sibling
2207 Returns the former next sibling of a cut node (or undef if the node
2208 has not been cut)
2209
2210 This makes it easier to write loops where you cut elements:
2211
2212 my $child= $parent->first_child( 'achild');
2213 while( $child->{'att'}->{'cut'})
2214 { $child->cut; $child= ($child->{former} && $child->{former}->{next_sibling}); }
2215
2216 former_prev_sibling
2217 Returns the former previous sibling of a cut node (or undef if the
2218 node has not been cut)
2219
2220 former_parent
2221 Returns the former parent of a cut node (or undef if the node has
2222 not been cut)
2223
2224 cut_children ($optional_condition)
2225 Cut all the children of the element (or all of those which satisfy
2226 the $optional_condition).
2227
2228 Return the list of children
2229
2230 cut_descendants ($optional_condition)
2231 Cut all the descendants of the element (or all of those which
2232 satisfy the $optional_condition).
2233
2234 Return the list of descendants
2235
2236 copy ($elt)
2237 Return a copy of the element. The copy is a "deep" copy: all sub-
2238 elements of the element are duplicated.
2239
2240 paste ($optional_position, $ref)
2241 Paste a (previously "cut" or newly generated) element. Die if the
2242 element already belongs to a tree.
2243
2244 Note that the calling element is pasted:
2245
2246 $child->paste( first_child => $existing_parent);
2247 $new_sibling->paste( after => $this_sibling_is_already_in_the_tree);
2248
2249 or
2250
2251 my $new_elt= XML::Twig::Elt->new( tag => $content);
2252 $new_elt->paste( $position => $existing_elt);
2253
2254 Example:
2255
2256 my $t= XML::Twig->new->parse( 'doc.xml')
2257 my $toc= $t->root->new( 'toc');
2258 $toc->paste( $t->root); # $toc is pasted as first child of the root
2259 foreach my $title ($t->findnodes( '/doc/section/title'))
2260 { my $title_toc= $title->copy;
2261 # paste $title_toc as the last child of toc
2262 $title_toc->paste( last_child => $toc)
2263 }
2264
2265 Position options:
2266
2267 first_child (default)
2268 The element is pasted as the first child of $ref
2269
2270 last_child
2271 The element is pasted as the last child of $ref
2272
2273 before
2274 The element is pasted before $ref, as its previous sibling.
2275
2276 after
2277 The element is pasted after $ref, as its next sibling.
2278
2279 within
2280 In this case an extra argument, $offset, should be supplied.
2281 The element will be pasted in the reference element (or in its
2282 first text child) at the given offset. To achieve this the
2283 reference element will be split at the offset.
2284
2285 Note that you can call directly the underlying method:
2286
2287 paste_before
2288 paste_after
2289 paste_first_child
2290 paste_last_child
2291 paste_within
2292 move ($optional_position, $ref)
2293 Move an element in the tree. This is just a "cut" then a "paste".
2294 The syntax is the same as "paste".
2295
2296 replace ($ref)
2297 Replaces an element in the tree. Sometimes it is just not possible
2298 to"cut" an element then "paste" another in its place, so "replace"
2299 comes in handy. The calling element replaces $ref.
2300
2301 replace_with (@elts)
2302 Replaces the calling element with one or more elements
2303
2304 delete
2305 Cut the element and frees the memory.
2306
2307 prefix ($text, $optional_option)
2308 Add a prefix to an element. If the element is a "PCDATA" element
2309 the text is added to the pcdata, if the elements first child is a
2310 "PCDATA" then the text is added to it's pcdata, otherwise a new
2311 "PCDATA" element is created and pasted as the first child of the
2312 element.
2313
2314 If the option is "asis" then the prefix is added asis: it is
2315 created in a separate "PCDATA" element with an "asis" property. You
2316 can then write:
2317
2318 $elt1->prefix( '<b>', 'asis');
2319
2320 to create a "<b>" in the output of "print".
2321
2322 suffix ($text, $optional_option)
2323 Add a suffix to an element. If the element is a "PCDATA" element
2324 the text is added to the pcdata, if the elements last child is a
2325 "PCDATA" then the text is added to it's pcdata, otherwise a new
2326 PCDATA element is created and pasted as the last child of the
2327 element.
2328
2329 If the option is "asis" then the suffix is added asis: it is
2330 created in a separate "PCDATA" element with an "asis" property. You
2331 can then write:
2332
2333 $elt2->suffix( '</b>', 'asis');
2334
2335 trim
2336 Trim the element in-place: spaces at the beginning and at the end
2337 of the element are discarded and multiple spaces within the element
2338 (or its descendants) are replaced by a single space.
2339
2340 Note that in some cases you can still end up with multiple spaces,
2341 if they are split between several elements:
2342
2343 <doc> text <b> hah! </b> yep</doc>
2344
2345 gets trimmed to
2346
2347 <doc>text <b> hah! </b> yep</doc>
2348
2349 This is somewhere in between a bug and a feature.
2350
2351 normalize
2352 merge together all consecutive pcdata elements in the element (if
2353 for example you have turned some elements into pcdata using
2354 "erase", this will give you a "clean" element in which there all
2355 text fragments are as long as possible).
2356
2357 simplify (%options)
2358 Return a data structure suspiciously similar to XML::Simple's.
2359 Options are identical to XMLin options, see XML::Simple doc for
2360 more details (or use DATA::dumper or YAML to dump the data
2361 structure)
2362
2363 Note: there is no magic here, if you write "$twig->parsefile( $file
2364 )->simplify();" then it will load the entire document in memory. I
2365 am afraid you will have to put some work into it to get just the
2366 bits you want and discard the rest. Look at the synopsys or the
2367 XML::Twig 101 section at the top of the docs for more information.
2368
2369 content_key
2370 forcearray
2371 keyattr
2372 noattr
2373 normalize_space
2374 aka normalise_space
2375
2376 variables (%var_hash)
2377 %var_hash is a hash { name => value }
2378
2379 This option allows variables in the XML to be expanded when the
2380 file is read. (there is no facility for putting the variable
2381 names back if you regenerate XML using XMLout).
2382
2383 A 'variable' is any text of the form ${name} (or $name) which
2384 occurs in an attribute value or in the text content of an
2385 element. If 'name' matches a key in the supplied hashref,
2386 ${name} will be replaced with the corresponding value from the
2387 hashref. If no matching key is found, the variable will not be
2388 replaced.
2389
2390 var_att ($attribute_name)
2391 This option gives the name of an attribute that will be used to
2392 create variables in the XML:
2393
2394 <dirs>
2395 <dir name="prefix">/usr/local</dir>
2396 <dir name="exec_prefix">$prefix/bin</dir>
2397 </dirs>
2398
2399 use "var => 'name'" to get $prefix replaced by /usr/local in
2400 the generated data structure
2401
2402 By default variables are captured by the following regexp:
2403 /$(\w+)/
2404
2405 var_regexp (regexp)
2406 This option changes the regexp used to capture variables. The
2407 variable name should be in $1
2408
2409 group_tags { grouping tag => grouped tag, grouping tag 2 => grouped
2410 tag 2...}
2411 Option used to simplify the structure: elements listed will not
2412 be used. Their children will be, they will be considered
2413 children of the element parent.
2414
2415 If the element is:
2416
2417 <config host="laptop.xmltwig.org">
2418 <server>localhost</server>
2419 <dirs>
2420 <dir name="base">/home/mrodrigu/standards</dir>
2421 <dir name="tools">$base/tools</dir>
2422 </dirs>
2423 <templates>
2424 <template name="std_def">std_def.templ</template>
2425 <template name="dummy">dummy</template>
2426 </templates>
2427 </config>
2428
2429 Then calling simplify with "group_tags => { dirs => 'dir',
2430 templates => 'template'}" makes the data structure be exactly
2431 as if the start and end tags for "dirs" and "templates" were
2432 not there.
2433
2434 A YAML dump of the structure
2435
2436 base: '/home/mrodrigu/standards'
2437 host: laptop.xmltwig.org
2438 server: localhost
2439 template:
2440 - std_def.templ
2441 - dummy.templ
2442 tools: '$base/tools'
2443
2444 split_at ($offset)
2445 Split a text ("PCDATA" or "CDATA") element in 2 at $offset, the
2446 original element now holds the first part of the string and a new
2447 element holds the right part. The new element is returned
2448
2449 If the element is not a text element then the first text child of
2450 the element is split
2451
2452 split ( $optional_regexp, $tag1, $atts1, $tag2, $atts2...)
2453 Split the text descendants of an element in place, the text is
2454 split using the $regexp, if the regexp includes () then the matched
2455 separators will be wrapped in elements. $1 is wrapped in $tag1,
2456 with attributes $atts1 if $atts1 is given (as a hashref), $2 is
2457 wrapped in $tag2...
2458
2459 if $elt is "<p>tati tata <b>tutu tati titi</b> tata tati tata</p>"
2460
2461 $elt->split( qr/(ta)ti/, 'foo', {type => 'toto'} )
2462
2463 will change $elt to
2464
2465 <p><foo type="toto">ta</foo> tata <b>tutu <foo type="toto">ta</foo>
2466 titi</b> tata <foo type="toto">ta</foo> tata</p>
2467
2468 The regexp can be passed either as a string or as "qr//" (perl
2469 5.005 and later), it defaults to \s+ just as the "split" built-in
2470 (but this would be quite a useless behaviour without the
2471 $optional_tag parameter)
2472
2473 $optional_tag defaults to PCDATA or CDATA, depending on the initial
2474 element type
2475
2476 The list of descendants is returned (including un-touched original
2477 elements and newly created ones)
2478
2479 mark ( $regexp, $optional_tag, $optional_attribute_ref)
2480 This method behaves exactly as split, except only the newly created
2481 elements are returned
2482
2483 wrap_children ( $regexp_string, $tag, $optional_attribute_hashref)
2484 Wrap the children of the element that match the regexp in an
2485 element $tag. If $optional_attribute_hashref is passed then the
2486 new element will have these attributes.
2487
2488 The $regexp_string includes tags, within pointy brackets, as in
2489 "<title><para>+" and the usual Perl modifiers (+*?...). Tags can
2490 be further qualified with attributes: "<para type="warning"
2491 classif="cosmic_secret">+". The values for attributes should be
2492 xml-escaped: "<candy type="M&Ms">*" ("<", "&" ">" and """
2493 should be escaped).
2494
2495 Note that elements might get extra "id" attributes in the process.
2496 See add_id. Use strip_att to remove unwanted id's.
2497
2498 Here is an example:
2499
2500 If the element $elt has the following content:
2501
2502 <elt>
2503 <p>para 1</p>
2504 <l_l1_1>list 1 item 1 para 1</l_l1_1>
2505 <l_l1>list 1 item 1 para 2</l_l1>
2506 <l_l1_n>list 1 item 2 para 1 (only para)</l_l1_n>
2507 <l_l1_n>list 1 item 3 para 1</l_l1_n>
2508 <l_l1>list 1 item 3 para 2</l_l1>
2509 <l_l1>list 1 item 3 para 3</l_l1>
2510 <l_l1_1>list 2 item 1 para 1</l_l1_1>
2511 <l_l1>list 2 item 1 para 2</l_l1>
2512 <l_l1_n>list 2 item 2 para 1 (only para)</l_l1_n>
2513 <l_l1_n>list 2 item 3 para 1</l_l1_n>
2514 <l_l1>list 2 item 3 para 2</l_l1>
2515 <l_l1>list 2 item 3 para 3</l_l1>
2516 </elt>
2517
2518 Then the code
2519
2520 $elt->wrap_children( q{<l_l1_1><l_l1>*} , li => { type => "ul1" });
2521 $elt->wrap_children( q{<l_l1_n><l_l1>*} , li => { type => "ul" });
2522
2523 $elt->wrap_children( q{<li type="ul1"><li type="ul">+}, "ul");
2524 $elt->strip_att( 'id');
2525 $elt->strip_att( 'type');
2526 $elt->print;
2527
2528 will output:
2529
2530 <elt>
2531 <p>para 1</p>
2532 <ul>
2533 <li>
2534 <l_l1_1>list 1 item 1 para 1</l_l1_1>
2535 <l_l1>list 1 item 1 para 2</l_l1>
2536 </li>
2537 <li>
2538 <l_l1_n>list 1 item 2 para 1 (only para)</l_l1_n>
2539 </li>
2540 <li>
2541 <l_l1_n>list 1 item 3 para 1</l_l1_n>
2542 <l_l1>list 1 item 3 para 2</l_l1>
2543 <l_l1>list 1 item 3 para 3</l_l1>
2544 </li>
2545 </ul>
2546 <ul>
2547 <li>
2548 <l_l1_1>list 2 item 1 para 1</l_l1_1>
2549 <l_l1>list 2 item 1 para 2</l_l1>
2550 </li>
2551 <li>
2552 <l_l1_n>list 2 item 2 para 1 (only para)</l_l1_n>
2553 </li>
2554 <li>
2555 <l_l1_n>list 2 item 3 para 1</l_l1_n>
2556 <l_l1>list 2 item 3 para 2</l_l1>
2557 <l_l1>list 2 item 3 para 3</l_l1>
2558 </li>
2559 </ul>
2560 </elt>
2561
2562 subs_text ($regexp, $replace)
2563 subs_text does text substitution, similar to perl's " s///"
2564 operator.
2565
2566 $regexp must be a perl regexp, created with the "qr" operator.
2567
2568 $replace can include "$1, $2"... from the $regexp. It can also be
2569 used to create element and entities, by using "&elt( tag => { att
2570 => val }, text)" (similar syntax as "new") and "&ent( name)".
2571
2572 Here is a rather complex example:
2573
2574 $elt->subs_text( qr{(?<!do not )link to (http://([^\s,]*))},
2575 'see &elt( a =>{ href => $1 }, $2)'
2576 );
2577
2578 This will replace text like link to http://www.xmltwig.org by see
2579 <a href="www.xmltwig.org">www.xmltwig.org</a>, but not do not link
2580 to...
2581
2582 Generating entities (here replacing spaces with ):
2583
2584 $elt->subs_text( qr{ }, '&ent( " ")');
2585
2586 or, using a variable:
2587
2588 my $ent=" ";
2589 $elt->subs_text( qr{ }, "&ent( '$ent')");
2590
2591 Note that the substitution is always global, as in using the "g"
2592 modifier in a perl substitution, and that it is performed on all
2593 text descendants of the element.
2594
2595 Bug: in the $regexp, you can only use "\1", "\2"... if the
2596 replacement expression does not include elements or attributes. eg
2597
2598 $t->subs_text( qr/((t[aiou])\2)/, '$2'); # ok, replaces toto, tata, titi, tutu by to, ta, ti, tu
2599 $t->subs_text( qr/((t[aiou])\2)/, '&elt(p => $1)' ); # NOK, does not find toto...
2600
2601 add_id ($optional_coderef)
2602 Add an id to the element.
2603
2604 The id is an attribute, "id" by default, see the "id" option for
2605 XML::Twig "new" to change it. Use an id starting with "#" to get an
2606 id that's not output by print, flush or sprint, yet that allows you
2607 to use the elt_id method to get the element easily.
2608
2609 If the element already has an id, no new id is generated.
2610
2611 By default the method create an id of the form "twig_id_<nnnn>",
2612 where "<nnnn>" is a number, incremented each time the method is
2613 called successfully.
2614
2615 set_id_seed ($prefix)
2616 by default the id generated by "add_id" is "twig_id_<nnnn>",
2617 "set_id_seed" changes the prefix to $prefix and resets the number
2618 to 1
2619
2620 strip_att ($att)
2621 Remove the attribute $att from all descendants of the element
2622 (including the element)
2623
2624 Return the element
2625
2626 change_att_name ($old_name, $new_name)
2627 Change the name of the attribute from $old_name to $new_name. If
2628 there is no attribute $old_name nothing happens.
2629
2630 lc_attnames
2631 Lower cases the name all the attributes of the element.
2632
2633 sort_children_on_value( %options)
2634 Sort the children of the element in place according to their text.
2635 All children are sorted.
2636
2637 Return the element, with its children sorted.
2638
2639 %options are
2640
2641 type : numeric | alpha (default: alpha)
2642 order : normal | reverse (default: normal)
2643
2644 Return the element, with its children sorted
2645
2646 sort_children_on_att ($att, %options)
2647 Sort the children of the element in place according to attribute
2648 $att. %options are the same as for "sort_children_on_value"
2649
2650 Return the element.
2651
2652 sort_children_on_field ($tag, %options)
2653 Sort the children of the element in place, according to the field
2654 $tag (the text of the first child of the child with this tag).
2655 %options are the same as for "sort_children_on_value".
2656
2657 Return the element, with its children sorted
2658
2659 sort_children( $get_key, %options)
2660 Sort the children of the element in place. The $get_key argument is
2661 a reference to a function that returns the sort key when passed an
2662 element.
2663
2664 For example:
2665
2666 $elt->sort_children( sub { $_[0]->{'att'}->{"nb"} + $_[0]->text },
2667 type => 'numeric', order => 'reverse'
2668 );
2669
2670 field_to_att ($cond, $att)
2671 Turn the text of the first sub-element matched by $cond into the
2672 value of attribute $att of the element. If $att is omitted then
2673 $cond is used as the name of the attribute, which makes sense only
2674 if $cond is a valid element (and attribute) name.
2675
2676 The sub-element is then cut.
2677
2678 att_to_field ($att, $tag)
2679 Take the value of attribute $att and create a sub-element $tag as
2680 first child of the element. If $tag is omitted then $att is used as
2681 the name of the sub-element.
2682
2683 get_xpath ($xpath, $optional_offset)
2684 Return a list of elements satisfying the $xpath. $xpath is an
2685 XPATH-like expression.
2686
2687 A subset of the XPATH abbreviated syntax is covered:
2688
2689 tag
2690 tag[1] (or any other positive number)
2691 tag[last()]
2692 tag[@att] (the attribute exists for the element)
2693 tag[@att="val"]
2694 tag[@att=~ /regexp/]
2695 tag[att1="val1" and att2="val2"]
2696 tag[att1="val1" or att2="val2"]
2697 tag[string()="toto"] (returns tag elements which text (as per the text method)
2698 is toto)
2699 tag[string()=~/regexp/] (returns tag elements which text (as per the text
2700 method) matches regexp)
2701 expressions can start with / (search starts at the document root)
2702 expressions can start with . (search starts at the current element)
2703 // can be used to get all descendants instead of just direct children
2704 * matches any tag
2705
2706 So the following examples from the XPath
2707 recommendation<http://www.w3.org/TR/xpath.html#path-abbrev> work:
2708
2709 para selects the para element children of the context node
2710 * selects all element children of the context node
2711 para[1] selects the first para child of the context node
2712 para[last()] selects the last para child of the context node
2713 */para selects all para grandchildren of the context node
2714 /doc/chapter[5]/section[2] selects the second section of the fifth chapter
2715 of the doc
2716 chapter//para selects the para element descendants of the chapter element
2717 children of the context node
2718 //para selects all the para descendants of the document root and thus selects
2719 all para elements in the same document as the context node
2720 //olist/item selects all the item elements in the same document as the
2721 context node that have an olist parent
2722 .//para selects the para element descendants of the context node
2723 .. selects the parent of the context node
2724 para[@type="warning"] selects all para children of the context node that have
2725 a type attribute with value warning
2726 employee[@secretary and @assistant] selects all the employee children of the
2727 context node that have both a secretary attribute and an assistant
2728 attribute
2729
2730 The elements will be returned in the document order.
2731
2732 If $optional_offset is used then only one element will be returned,
2733 the one with the appropriate offset in the list, starting at 0
2734
2735 Quoting and interpolating variables can be a pain when the Perl
2736 syntax and the XPATH syntax collide, so use alternate quoting
2737 mechanisms like q or qq (I like q{} and qq{} myself).
2738
2739 Here are some more examples to get you started:
2740
2741 my $p1= "p1";
2742 my $p2= "p2";
2743 my @res= $t->get_xpath( qq{p[string( "$p1") or string( "$p2")]});
2744
2745 my $a= "a1";
2746 my @res= $t->get_xpath( qq{//*[@att="$a"]});
2747
2748 my $val= "a1";
2749 my $exp= qq{//p[ \@att='$val']}; # you need to use \@ or you will get a warning
2750 my @res= $t->get_xpath( $exp);
2751
2752 Note that the only supported regexps delimiters are / and that you
2753 must backslash all / in regexps AND in regular strings.
2754
2755 XML::Twig does not provide natively full XPATH support, but you can
2756 use "XML::Twig::XPath" to get "findnodes" to use "XML::XPath" as
2757 the XPath engine, with full coverage of the spec.
2758
2759 "XML::Twig::XPath" to get "findnodes" to use "XML::XPath" as the
2760 XPath engine, with full coverage of the spec.
2761
2762 find_nodes
2763 same as"get_xpath"
2764
2765 findnodes
2766 same as "get_xpath"
2767
2768 text @optional_options
2769 Return a string consisting of all the "PCDATA" and "CDATA" in an
2770 element, without any tags. The text is not XML-escaped: base
2771 entities such as "&" and "<" are not escaped.
2772
2773 The '"no_recurse"' option will only return the text of the element,
2774 not of any included sub-elements (same as "text_only").
2775
2776 text_only
2777 Same as "text" except that the text returned doesn't include the
2778 text of sub-elements.
2779
2780 trimmed_text
2781 Same as "text" except that the text is trimmed: leading and
2782 trailing spaces are discarded, consecutive spaces are collapsed
2783
2784 set_text ($string)
2785 Set the text for the element: if the element is a "PCDATA", just
2786 set its text, otherwise cut all the children of the element and
2787 create a single "PCDATA" child for it, which holds the text.
2788
2789 merge ($elt2)
2790 Move the content of $elt2 within the element
2791
2792 insert ($tag1, [$optional_atts1], $tag2, [$optional_atts2],...)
2793 For each tag in the list inserts an element $tag as the only child
2794 of the element. The element gets the optional attributes
2795 in"$optional_atts<n>." All children of the element are set as
2796 children of the new element. The upper level element is returned.
2797
2798 $p->insert( table => { border=> 1}, 'tr', 'td')
2799
2800 put $p in a table with a visible border, a single "tr" and a single
2801 "td" and return the "table" element:
2802
2803 <p><table border="1"><tr><td>original content of p</td></tr></table></p>
2804
2805 wrap_in (@tag)
2806 Wrap elements in @tag as the successive ancestors of the element,
2807 returns the new element. "$elt->wrap_in( 'td', 'tr', 'table')"
2808 wraps the element as a single cell in a table for example.
2809
2810 Optionally each tag can be followed by a hashref of attributes,
2811 that will be set on the wrapping element:
2812
2813 $elt->wrap_in( p => { class => "advisory" }, div => { class => "intro", id => "div_intro" });
2814
2815 insert_new_elt ($opt_position, $tag, $opt_atts_hashref, @opt_content)
2816 Combines a "new " and a "paste ": creates a new element using $tag,
2817 $opt_atts_hashref and @opt_content which are arguments similar to
2818 those for "new", then paste it, using $opt_position or
2819 'first_child', relative to $elt.
2820
2821 Return the newly created element
2822
2823 erase
2824 Erase the element: the element is deleted and all of its children
2825 are pasted in its place.
2826
2827 set_content ( $optional_atts, @list_of_elt_and_strings) (
2828 $optional_atts, '#EMPTY')
2829 Set the content for the element, from a list of strings and
2830 elements. Cuts all the element children, then pastes the list
2831 elements as the children. This method will create a "PCDATA"
2832 element for any strings in the list.
2833
2834 The $optional_atts argument is the ref of a hash of attributes. If
2835 this argument is used then the previous attributes are deleted,
2836 otherwise they are left untouched.
2837
2838 WARNING: if you rely on ID's then you will have to set the id
2839 yourself. At this point the element does not belong to a twig yet,
2840 so the ID attribute is not known so it won't be stored in the ID
2841 list.
2842
2843 A content of '"#EMPTY"' creates an empty element;
2844
2845 namespace ($optional_prefix)
2846 Return the URI of the namespace that $optional_prefix or the
2847 element name belongs to. If the name doesn't belong to any
2848 namespace, "undef" is returned.
2849
2850 local_name
2851 Return the local name (without the prefix) for the element
2852
2853 ns_prefix
2854 Return the namespace prefix for the element
2855
2856 current_ns_prefixes
2857 Return a list of namespace prefixes valid for the element. The
2858 order of the prefixes in the list has no meaning. If the default
2859 namespace is currently bound, '' appears in the list.
2860
2861 inherit_att ($att, @optional_tag_list)
2862 Return the value of an attribute inherited from parent tags. The
2863 value returned is found by looking for the attribute in the element
2864 then in turn in each of its ancestors. If the @optional_tag_list is
2865 supplied only those ancestors whose tag is in the list will be
2866 checked.
2867
2868 all_children_are ($optional_condition)
2869 return 1 if all children of the element pass the
2870 $optional_condition, 0 otherwise
2871
2872 level ($optional_condition)
2873 Return the depth of the element in the twig (root is 0). If
2874 $optional_condition is given then only ancestors that match the
2875 condition are counted.
2876
2877 WARNING: in a tree created using the "twig_roots" option this will
2878 not return the level in the document tree, level 0 will be the
2879 document root, level 1 will be the "twig_roots" elements. During
2880 the parsing (in a "twig_handler") you can use the "depth" method on
2881 the twig object to get the real parsing depth.
2882
2883 in ($potential_parent)
2884 Return true if the element is in the potential_parent
2885 ($potential_parent is an element)
2886
2887 in_context ($cond, $optional_level)
2888 Return true if the element is included in an element which passes
2889 $cond optionally within $optional_level levels. The returned value
2890 is the including element.
2891
2892 pcdata
2893 Return the text of a "PCDATA" element or "undef" if the element is
2894 not "PCDATA".
2895
2896 pcdata_xml_string
2897 Return the text of a "PCDATA" element or undef if the element is
2898 not "PCDATA". The text is "XML-escaped" ('&' and '<' are replaced
2899 by '&' and '<')
2900
2901 set_pcdata ($text)
2902 Set the text of a "PCDATA" element. This method does not check that
2903 the element is indeed a "PCDATA" so usually you should use
2904 "set_text" instead.
2905
2906 append_pcdata ($text)
2907 Add the text at the end of a "PCDATA" element.
2908
2909 is_cdata
2910 Return 1 if the element is a "CDATA" element, returns 0 otherwise.
2911
2912 is_text
2913 Return 1 if the element is a "CDATA" or "PCDATA" element, returns 0
2914 otherwise.
2915
2916 cdata
2917 Return the text of a "CDATA" element or "undef" if the element is
2918 not "CDATA".
2919
2920 cdata_string
2921 Return the XML string of a "CDATA" element, including the opening
2922 and closing markers.
2923
2924 set_cdata ($text)
2925 Set the text of a "CDATA" element.
2926
2927 append_cdata ($text)
2928 Add the text at the end of a "CDATA" element.
2929
2930 remove_cdata
2931 Turns all "CDATA" sections in the element into regular "PCDATA"
2932 elements. This is useful when converting XML to HTML, as browsers
2933 do not support CDATA sections.
2934
2935 extra_data
2936 Return the extra_data (comments and PI's) attached to an element
2937
2938 set_extra_data ($extra_data)
2939 Set the extra_data (comments and PI's) attached to an element
2940
2941 append_extra_data ($extra_data)
2942 Append extra_data to the existing extra_data before the element (if
2943 no previous extra_data exists then it is created)
2944
2945 set_asis
2946 Set a property of the element that causes it to be output without
2947 being XML escaped by the print functions: if it contains "a < b" it
2948 will be output as such and not as "a < b". This can be useful to
2949 create text elements that will be output as markup. Note that all
2950 "PCDATA" descendants of the element are also marked as having the
2951 property (they are the ones that are actually impacted by the
2952 change).
2953
2954 If the element is a "CDATA" element it will also be output asis,
2955 without the "CDATA" markers. The same goes for any "CDATA"
2956 descendant of the element
2957
2958 set_not_asis
2959 Unsets the "asis" property for the element and its text
2960 descendants.
2961
2962 is_asis
2963 Return the "asis" property status of the element ( 1 or "undef")
2964
2965 closed
2966 Return true if the element has been closed. Might be useful if you
2967 are somewhere in the tree, during the parse, and have no idea
2968 whether a parent element is completely loaded or not.
2969
2970 get_type
2971 Return the type of the element: '"#ELT"' for "real" elements, or
2972 '"#PCDATA"', '"#CDATA"', '"#COMMENT"', '"#ENT"', '"#PI"'
2973
2974 is_elt
2975 Return the tag if the element is a "real" element, or 0 if it is
2976 "PCDATA", "CDATA"...
2977
2978 contains_only_text
2979 Return 1 if the element does not contain any other "real" element
2980
2981 contains_only ($exp)
2982 Return the list of children if all children of the element match
2983 the expression $exp
2984
2985 if( $para->contains_only( 'tt')) { ... }
2986
2987 contains_a_single ($exp)
2988 If the element contains a single child that matches the expression
2989 $exp returns that element. Otherwise returns 0.
2990
2991 is_field
2992 same as "contains_only_text"
2993
2994 is_pcdata
2995 Return 1 if the element is a "PCDATA" element, returns 0 otherwise.
2996
2997 is_ent
2998 Return 1 if the element is an entity (an unexpanded entity)
2999 element, return 0 otherwise.
3000
3001 is_empty
3002 Return 1 if the element is empty, 0 otherwise
3003
3004 set_empty
3005 Flags the element as empty. No further check is made, so if the
3006 element is actually not empty the output will be messed. The only
3007 effect of this method is that the output will be "<tag
3008 att="value""/>".
3009
3010 set_not_empty
3011 Flags the element as not empty. if it is actually empty then the
3012 element will be output as "<tag att="value""></tag>"
3013
3014 is_pi
3015 Return 1 if the element is a processing instruction ("#PI")
3016 element, return 0 otherwise.
3017
3018 target
3019 Return the target of a processing instruction
3020
3021 set_target ($target)
3022 Set the target of a processing instruction
3023
3024 data
3025 Return the data part of a processing instruction
3026
3027 set_data ($data)
3028 Set the data of a processing instruction
3029
3030 set_pi ($target, $data)
3031 Set the target and data of a processing instruction
3032
3033 pi_string
3034 Return the string form of a processing instruction ("<?target
3035 data?>")
3036
3037 is_comment
3038 Return 1 if the element is a comment ("#COMMENT") element, return 0
3039 otherwise.
3040
3041 set_comment ($comment_text)
3042 Set the text for a comment
3043
3044 comment
3045 Return the content of a comment (just the text, not the "<!--" and
3046 "-->")
3047
3048 comment_string
3049 Return the XML string for a comment ("<!-- comment -->")
3050
3051 Note that an XML comment cannot start or end with a '-', or include
3052 '--' (http://www.w3.org/TR/2008/REC-xml-20081126/#sec-comments), if
3053 that is the case (because you have created the comment yourself
3054 presumably, as it could not be in the input XML), then a space will
3055 be inserted before an initial '-', after a trailing one or between
3056 two '-' in the comment (which could presumably mangle javascript
3057 "hidden" in an XHTML comment);
3058
3059 set_ent ($entity)
3060 Set an (non-expanded) entity ("#ENT"). $entity) is the entity text
3061 ("&ent;")
3062
3063 ent Return the entity for an entity ("#ENT") element ("&ent;")
3064
3065 ent_name
3066 Return the entity name for an entity ("#ENT") element ("ent")
3067
3068 ent_string
3069 Return the entity, either expanded if the expanded version is
3070 available, or non-expanded ("&ent;") otherwise
3071
3072 child ($offset, $optional_condition)
3073 Return the $offset-th child of the element, optionally the
3074 $offset-th child that matches $optional_condition. The children are
3075 treated as a list, so "$elt->child( 0)" is the first child, while
3076 "$elt->child( -1)" is the last child.
3077
3078 child_text ($offset, $optional_condition)
3079 Return the text of a child or "undef" if the sibling does not
3080 exist. Arguments are the same as child.
3081
3082 last_child ($optional_condition)
3083 Return the last child of the element, or the last child matching
3084 $optional_condition (ie the last of the element children matching
3085 the condition).
3086
3087 last_child_text ($optional_condition)
3088 Same as "first_child_text" but for the last child.
3089
3090 sibling ($offset, $optional_condition)
3091 Return the next or previous $offset-th sibling of the element, or
3092 the $offset-th one matching $optional_condition. If $offset is
3093 negative then a previous sibling is returned, if $offset is
3094 positive then a next sibling is returned. "$offset=0" returns the
3095 element if there is no condition or if the element matches the
3096 condition>, "undef" otherwise.
3097
3098 sibling_text ($offset, $optional_condition)
3099 Return the text of a sibling or "undef" if the sibling does not
3100 exist. Arguments are the same as "sibling".
3101
3102 prev_siblings ($optional_condition)
3103 Return the list of previous siblings (optionally matching
3104 $optional_condition) for the element. The elements are ordered in
3105 document order.
3106
3107 next_siblings ($optional_condition)
3108 Return the list of siblings (optionally matching
3109 $optional_condition) following the element. The elements are
3110 ordered in document order.
3111
3112 siblings ($optional_condition)
3113 Return the list of siblings (optionally matching
3114 $optional_condition) of the element (excluding the element itself).
3115 The elements are ordered in document order.
3116
3117 pos ($optional_condition)
3118 Return the position of the element in the children list. The first
3119 child has a position of 1 (as in XPath).
3120
3121 If the $optional_condition is given then only siblings that match
3122 the condition are counted. If the element itself does not match the
3123 condition then 0 is returned.
3124
3125 atts
3126 Return a hash ref containing the element attributes
3127
3128 set_atts ({ att1=>$att1_val, att2=> $att2_val... })
3129 Set the element attributes with the hash ref supplied as the
3130 argument. The previous attributes are lost (ie the attributes set
3131 by "set_atts" replace all of the attributes of the element).
3132
3133 You can also pass a list instead of a hashref: "$elt->set_atts(
3134 att1 => 'val1',...)"
3135
3136 del_atts
3137 Deletes all the element attributes.
3138
3139 att_nb
3140 Return the number of attributes for the element
3141
3142 has_atts
3143 Return true if the element has attributes (in fact return the
3144 number of attributes, thus being an alias to "att_nb"
3145
3146 has_no_atts
3147 Return true if the element has no attributes, false (0) otherwise
3148
3149 att_names
3150 return a list of the attribute names for the element
3151
3152 att_xml_string ($att, $options)
3153 Return the attribute value, where '&', '<' and quote (" or the
3154 value of the quote option at twig creation) are XML-escaped.
3155
3156 The options are passed as a hashref, setting "escape_gt" to a true
3157 value will also escape '>' ($elt( 'myatt', { escape_gt => 1 });
3158
3159 set_id ($id)
3160 Set the "id" attribute of the element to the value. See "elt_id "
3161 to change the id attribute name
3162
3163 id Gets the id attribute value
3164
3165 del_id ($id)
3166 Deletes the "id" attribute of the element and remove it from the id
3167 list for the document
3168
3169 class
3170 Return the "class" attribute for the element (methods on the
3171 "class" attribute are quite convenient when dealing with XHTML, or
3172 plain XML that will eventually be displayed using CSS)
3173
3174 lclass
3175 same as class, except that this method is an lvalue, so you can do
3176 "$elt->lclass= "foo""
3177
3178 set_class ($class)
3179 Set the "class" attribute for the element to $class
3180
3181 add_class ($class)
3182 Add $class to the element "class" attribute: the new class is added
3183 only if it is not already present.
3184
3185 Note that classes are then sorted alphabetically, so the "class"
3186 attribute can be changed even if the class is already there
3187
3188 remove_class ($class)
3189 Remove $class from the element "class" attribute.
3190
3191 Note that classes are then sorted alphabetically, so the "class"
3192 attribute can be changed even if the class is already there
3193
3194 add_to_class ($class)
3195 alias for add_class
3196
3197 att_to_class ($att)
3198 Set the "class" attribute to the value of attribute $att
3199
3200 add_att_to_class ($att)
3201 Add the value of attribute $att to the "class" attribute of the
3202 element
3203
3204 move_att_to_class ($att)
3205 Add the value of attribute $att to the "class" attribute of the
3206 element and delete the attribute
3207
3208 tag_to_class
3209 Set the "class" attribute of the element to the element tag
3210
3211 add_tag_to_class
3212 Add the element tag to its "class" attribute
3213
3214 set_tag_class ($new_tag)
3215 Add the element tag to its "class" attribute and sets the tag to
3216 $new_tag
3217
3218 in_class ($class)
3219 Return true (1) if the element is in the class $class (if $class is
3220 one of the tokens in the element "class" attribute)
3221
3222 tag_to_span
3223 Change the element tag tp "span" and set its class to the old tag
3224
3225 tag_to_div
3226 Change the element tag tp "div" and set its class to the old tag
3227
3228 DESTROY
3229 Frees the element from memory.
3230
3231 start_tag
3232 Return the string for the start tag for the element, including the
3233 "/>" at the end of an empty element tag
3234
3235 end_tag
3236 Return the string for the end tag of an element. For an empty
3237 element, this returns the empty string ('').
3238
3239 xml_string @optional_options
3240 Equivalent to "$elt->sprint( 1)", returns the string for the entire
3241 element, excluding the element's tags (but nested element tags are
3242 present)
3243
3244 The '"no_recurse"' option will only return the text of the element,
3245 not of any included sub-elements (same as "xml_text_only").
3246
3247 inner_xml
3248 Another synonym for xml_string
3249
3250 outer_xml
3251 An other synonym for sprint
3252
3253 xml_text
3254 Return the text of the element, encoded (and processed by the
3255 current "output_filter" or "output_encoding" options, without any
3256 tag.
3257
3258 xml_text_only
3259 Same as "xml_text" except that the text returned doesn't include
3260 the text of sub-elements.
3261
3262 set_pretty_print ($style)
3263 Set the pretty print method, amongst '"none"' (default),
3264 '"nsgmls"', '"nice"', '"indented"', '"record"' and '"record_c"'
3265
3266 pretty_print styles:
3267
3268 none
3269 the default, no "\n" is used
3270
3271 nsgmls
3272 nsgmls style, with "\n" added within tags
3273
3274 nice
3275 adds "\n" wherever possible (NOT SAFE, can lead to invalid XML)
3276
3277 indented
3278 same as "nice" plus indents elements (NOT SAFE, can lead to
3279 invalid XML)
3280
3281 record
3282 table-oriented pretty print, one field per line
3283
3284 record_c
3285 table-oriented pretty print, more compact than "record", one
3286 record per line
3287
3288 set_empty_tag_style ($style)
3289 Set the method to output empty tags, amongst '"normal"' (default),
3290 '"html"', and '"expand"',
3291
3292 "normal" outputs an empty tag '"<tag/>"', "html" adds a space
3293 '"<tag />"' for elements that can be empty in XHTML and "expand"
3294 outputs '"<tag></tag>"'
3295
3296 set_remove_cdata ($flag)
3297 set (or unset) the flag that forces the twig to output CDATA
3298 sections as regular (escaped) PCDATA
3299
3300 set_indent ($string)
3301 Set the indentation for the indented pretty print style (default is
3302 2 spaces)
3303
3304 set_quote ($quote)
3305 Set the quotes used for attributes. can be '"double"' (default) or
3306 '"single"'
3307
3308 cmp ($elt)
3309 Compare the order of the 2 elements in a twig.
3310
3311 C<$a> is the <A>..</A> element, C<$b> is the <B>...</B> element
3312
3313 document $a->cmp( $b)
3314 <A> ... </A> ... <B> ... </B> -1
3315 <A> ... <B> ... </B> ... </A> -1
3316 <B> ... </B> ... <A> ... </A> 1
3317 <B> ... <A> ... </A> ... </B> 1
3318 $a == $b 0
3319 $a and $b not in the same tree undef
3320
3321 before ($elt)
3322 Return 1 if $elt starts before the element, 0 otherwise. If the 2
3323 elements are not in the same twig then return "undef".
3324
3325 if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
3326
3327 after ($elt)
3328 Return 1 if $elt starts after the element, 0 otherwise. If the 2
3329 elements are not in the same twig then return "undef".
3330
3331 if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
3332
3333 other comparison methods
3334 lt
3335 le
3336 gt
3337 ge
3338 path
3339 Return the element context in a form similar to XPath's short form:
3340 '"/root/tag1/../tag"'
3341
3342 xpath
3343 Return a unique XPath expression that can be used to find the
3344 element again.
3345
3346 It looks like "/doc/sect[3]/title": unique elements do not have an
3347 index, the others do.
3348
3349 flush
3350 flushes the twig up to the current element (strictly equivalent to
3351 "$elt->root->flush")
3352
3353 private methods
3354 Low-level methods on the twig:
3355
3356 set_parent ($parent)
3357 set_first_child ($first_child)
3358 set_last_child ($last_child)
3359 set_prev_sibling ($prev_sibling)
3360 set_next_sibling ($next_sibling)
3361 set_twig_current
3362 del_twig_current
3363 twig_current
3364 contains_text
3365
3366 Those methods should not be used, unless of course you find some
3367 creative and interesting, not to mention useful, ways to do it.
3368
3369 cond
3370 Most of the navigation functions accept a condition as an optional
3371 argument The first element (or all elements for "children " or
3372 "ancestors ") that passes the condition is returned.
3373
3374 The condition is a single step of an XPath expression using the XPath
3375 subset defined by "get_xpath". Additional conditions are:
3376
3377 The condition can be
3378
3379 #ELT
3380 return a "real" element (not a PCDATA, CDATA, comment or pi
3381 element)
3382
3383 #TEXT
3384 return a PCDATA or CDATA element
3385
3386 regular expression
3387 return an element whose tag matches the regexp. The regexp has to
3388 be created with "qr//" (hence this is available only on perl 5.005
3389 and above)
3390
3391 code reference
3392 applies the code, passing the current element as argument, if the
3393 code returns true then the element is returned, if it returns false
3394 then the code is applied to the next candidate.
3395
3396 XML::Twig::XPath
3397 XML::Twig implements a subset of XPath through the "get_xpath" method.
3398
3399 If you want to use the whole XPath power, then you can use
3400 "XML::Twig::XPath" instead. In this case "XML::Twig" uses "XML::XPath"
3401 to execute XPath queries. You will of course need "XML::XPath"
3402 installed to be able to use "XML::Twig::XPath".
3403
3404 See XML::XPath for more information.
3405
3406 The methods you can use are:
3407
3408 findnodes ($path)
3409 return a list of nodes found by $path.
3410
3411 findnodes_as_string ($path)
3412 return the nodes found reproduced as XML. The result is not
3413 guaranteed to be valid XML though.
3414
3415 findvalue ($path)
3416 return the concatenation of the text content of the result nodes
3417
3418 In order for "XML::XPath" to be used as the XPath engine the following
3419 methods are included in "XML::Twig":
3420
3421 in XML::Twig
3422
3423 getRootNode
3424 getParentNode
3425 getChildNodes
3426
3427 in XML::Twig::Elt
3428
3429 string_value
3430 toString
3431 getName
3432 getRootNode
3433 getNextSibling
3434 getPreviousSibling
3435 isElementNode
3436 isTextNode
3437 isPI
3438 isPINode
3439 isProcessingInstructionNode
3440 isComment
3441 isCommentNode
3442 getTarget
3443 getChildNodes
3444 getElementById
3445
3446 XML::Twig::XPath::Elt
3447 The methods you can use are the same as on "XML::Twig::XPath" elements:
3448
3449 findnodes ($path)
3450 return a list of nodes found by $path.
3451
3452 findnodes_as_string ($path)
3453 return the nodes found reproduced as XML. The result is not
3454 guaranteed to be valid XML though.
3455
3456 findvalue ($path)
3457 return the concatenation of the text content of the result nodes
3458
3459 XML::Twig::Entity_list
3460 new Create an entity list.
3461
3462 add ($ent)
3463 Add an entity to an entity list.
3464
3465 add_new_ent ($name, $val, $sysid, $pubid, $ndata, $param)
3466 Create a new entity and add it to the entity list
3467
3468 delete ($ent or $tag).
3469 Delete an entity (defined by its name or by the Entity object) from
3470 the list.
3471
3472 print ($optional_filehandle)
3473 Print the entity list.
3474
3475 list
3476 Return the list as an array
3477
3478 XML::Twig::Entity
3479 new ($name, $val, $sysid, $pubid, $ndata, $param)
3480 Same arguments as the Entity handler for XML::Parser.
3481
3482 print ($optional_filehandle)
3483 Print an entity declaration.
3484
3485 name
3486 Return the name of the entity
3487
3488 val Return the value of the entity
3489
3490 sysid
3491 Return the system id for the entity (for NDATA entities)
3492
3493 pubid
3494 Return the public id for the entity (for NDATA entities)
3495
3496 ndata
3497 Return true if the entity is an NDATA entity
3498
3499 param
3500 Return true if the entity is a parameter entity
3501
3502 text
3503 Return the entity declaration text.
3504
3506 Additional examples (and a complete tutorial) can be found on the
3507 XML::Twig Page<http://www.xmltwig.org/xmltwig/>
3508
3509 To figure out what flush does call the following script with an XML
3510 file and an element name as arguments
3511
3512 use XML::Twig;
3513
3514 my ($file, $elt)= @ARGV;
3515 my $t= XML::Twig->new( twig_handlers =>
3516 { $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} });
3517 $t->parsefile( $file, ErrorContext => 2);
3518 $t->flush;
3519 print "\n";
3520
3522 Subclassing XML::Twig
3523 Useful methods:
3524
3525 elt_class
3526 In order to subclass "XML::Twig" you will probably need to subclass
3527 also "XML::Twig::Elt". Use the "elt_class" option when you create
3528 the "XML::Twig" object to get the elements created in a different
3529 class (which should be a subclass of "XML::Twig::Elt".
3530
3531 add_options
3532 If you inherit "XML::Twig" new method but want to add more options
3533 to it you can use this method to prevent XML::Twig to issue
3534 warnings for those additional options.
3535
3536 DTD Handling
3537 There are 3 possibilities here. They are:
3538
3539 No DTD
3540 No doctype, no DTD information, no entity information, the world is
3541 simple...
3542
3543 Internal DTD
3544 The XML document includes an internal DTD, and maybe entity
3545 declarations.
3546
3547 If you use the load_DTD option when creating the twig the DTD
3548 information and the entity declarations can be accessed.
3549
3550 The DTD and the entity declarations will be "flush"'ed (or
3551 "print"'ed) either as is (if they have not been modified) or as
3552 reconstructed (poorly, comments are lost, order is not kept, due to
3553 it's content this DTD should not be viewed by anyone) if they have
3554 been modified. You can also modify them directly by changing the
3555 "$twig->{twig_doctype}->{internal}" field (straight from
3556 XML::Parser, see the "Doctype" handler doc)
3557
3558 External DTD
3559 The XML document includes a reference to an external DTD, and maybe
3560 entity declarations.
3561
3562 If you use the "load_DTD" when creating the twig the DTD
3563 information and the entity declarations can be accessed. The entity
3564 declarations will be "flush"'ed (or "print"'ed) either as is (if
3565 they have not been modified) or as reconstructed (badly, comments
3566 are lost, order is not kept).
3567
3568 You can change the doctype through the "$twig->set_doctype" method
3569 and print the dtd through the "$twig->dtd_text" or
3570 "$twig->dtd_print"
3571 methods.
3572
3573 If you need to modify the entity list this is probably the easiest
3574 way to do it.
3575
3576 Flush
3577 Remember that element handlers are called when the element is CLOSED,
3578 so if you have handlers for nested elements the inner handlers will be
3579 called first. It makes it for example trickier than it would seem to
3580 number nested sections (or clauses, or divs), as the titles in the
3581 inner sections are handled before the outer sections.
3582
3584 segfault during parsing
3585 This happens when parsing huge documents, or lots of small ones,
3586 with a version of Perl before 5.16.
3587
3588 This is due to a bug in the way weak references are handled in Perl
3589 itself.
3590
3591 The fix is either to upgrade to Perl 5.16 or later ("perlbrew" is a
3592 great tool to manage several installations of perl on the same
3593 machine).
3594
3595 An other, NOT RECOMMENDED, way of fixing the problem, is to switch
3596 off weak references by writing "XML::Twig::_set_weakrefs( 0);" at
3597 the top of the code. This is totally unsupported, and may lead to
3598 other problems though,
3599
3600 entity handling
3601 Due to XML::Parser behaviour, non-base entities in attribute values
3602 disappear if they are not declared in the document:
3603 "att="val&ent;"" will be turned into "att => val", unless you use
3604 the "keep_encoding" argument to "XML::Twig->new"
3605
3606 DTD handling
3607 The DTD handling methods are quite bugged. No one uses them and it
3608 seems very difficult to get them to work in all cases, including
3609 with several slightly incompatible versions of XML::Parser and of
3610 libexpat.
3611
3612 Basically you can read the DTD, output it back properly, and update
3613 entities, but not much more.
3614
3615 So use XML::Twig with standalone documents, or with documents
3616 refering to an external DTD, but don't expect it to properly parse
3617 and even output back the DTD.
3618
3619 memory leak
3620 If you use a REALLY old Perl (5.005!) and a lot of twigs you might
3621 find that you leak quite a lot of memory (about 2Ks per twig). You
3622 can use the "dispose " method to free that memory after you are
3623 done.
3624
3625 If you create elements the same thing might happen, use the
3626 "delete" method to get rid of them.
3627
3628 Alternatively installing the "Scalar::Util" (or "WeakRef") module
3629 on a version of Perl that supports it (>5.6.0) will get rid of the
3630 memory leaks automagically.
3631
3632 ID list
3633 The ID list is NOT updated when elements are cut or deleted.
3634
3635 change_gi
3636 This method will not function properly if you do:
3637
3638 $twig->change_gi( $old1, $new);
3639 $twig->change_gi( $old2, $new);
3640 $twig->change_gi( $new, $even_newer);
3641
3642 sanity check on XML::Parser method calls
3643 XML::Twig should really prevent calls to some XML::Parser methods,
3644 especially the "setHandlers" method.
3645
3646 pretty printing
3647 Pretty printing (at least using the '"indented"' style) is hard to
3648 get right! Only elements that belong to the document will be
3649 properly indented. Printing elements that do not belong to the twig
3650 makes it impossible for XML::Twig to figure out their depth, and
3651 thus their indentation level.
3652
3653 Also there is an unavoidable bug when using "flush" and pretty
3654 printing for elements with mixed content that start with an
3655 embedded element:
3656
3657 <elt><b>b</b>toto<b>bold</b></elt>
3658
3659 will be output as
3660
3661 <elt>
3662 <b>b</b>toto<b>bold</b></elt>
3663
3664 if you flush the twig when you find the "<b>" element
3665
3667 These are the things that can mess up calling code, especially if
3668 threaded. They might also cause problem under mod_perl.
3669
3670 Exported constants
3671 Whether you want them or not you get them! These are subroutines to
3672 use as constant when creating or testing elements
3673
3674 PCDATA return '#PCDATA'
3675 CDATA return '#CDATA'
3676 PI return '#PI', I had the choice between PROC and PI :--(
3677
3678 Module scoped values: constants
3679 these should cause no trouble:
3680
3681 %base_ent= ( '>' => '>',
3682 '<' => '<',
3683 '&' => '&',
3684 "'" => ''',
3685 '"' => '"',
3686 );
3687 CDATA_START = "<![CDATA[";
3688 CDATA_END = "]]>";
3689 PI_START = "<?";
3690 PI_END = "?>";
3691 COMMENT_START = "<!--";
3692 COMMENT_END = "-->";
3693
3694 pretty print styles
3695
3696 ( $NSGMLS, $NICE, $INDENTED, $INDENTED_C, $WRAPPED, $RECORD1, $RECORD2)= (1..7);
3697
3698 empty tag output style
3699
3700 ( $HTML, $EXPAND)= (1..2);
3701
3702 Module scoped values: might be changed
3703 Most of these deal with pretty printing, so the worst that can
3704 happen is probably that XML output does not look right, but is
3705 still valid and processed identically by XML processors.
3706
3707 $empty_tag_style can mess up HTML bowsers though and changing $ID
3708 would most likely create problems.
3709
3710 $pretty=0; # pretty print style
3711 $quote='"'; # quote for attributes
3712 $INDENT= ' '; # indent for indented pretty print
3713 $empty_tag_style= 0; # how to display empty tags
3714 $ID # attribute used as an id ('id' by default)
3715
3716 Module scoped values: definitely changed
3717 These 2 variables are used to replace tags by an index, thus saving
3718 some space when creating a twig. If they really cause you too much
3719 trouble, let me know, it is probably possible to create either a
3720 switch or at least a version of XML::Twig that does not perform
3721 this optimization.
3722
3723 %gi2index; # tag => index
3724 @index2gi; # list of tags
3725
3726 If you need to manipulate all those values, you can use the following
3727 methods on the XML::Twig object:
3728
3729 global_state
3730 Return a hashref with all the global variables used by XML::Twig
3731
3732 The hash has the following fields: "pretty", "quote", "indent",
3733 "empty_tag_style", "keep_encoding", "expand_external_entities",
3734 "output_filter", "output_text_filter", "keep_atts_order"
3735
3736 set_global_state ($state)
3737 Set the global state, $state is a hashref
3738
3739 save_global_state
3740 Save the current global state
3741
3742 restore_global_state
3743 Restore the previously saved (using "Lsave_global_state"> state
3744
3746 SAX handlers
3747 Allowing XML::Twig to work on top of any SAX parser
3748
3749 multiple twigs are not well supported
3750 A number of twig features are just global at the moment. These
3751 include the ID list and the "tag pool" (if you use "change_gi" then
3752 you change the tag for ALL twigs).
3753
3754 A future version will try to support this while trying not to be to
3755 hard on performance (at least when a single twig is used!).
3756
3758 Michel Rodriguez <mirod@cpan.org>
3759
3761 This library is free software; you can redistribute it and/or modify it
3762 under the same terms as Perl itself.
3763
3764 Bug reports should be sent using: RT
3765 <http://rt.cpan.org/NoAuth/Bugs.html?Dist=XML-Twig>
3766
3767 Comments can be sent to mirod@cpan.org
3768
3769 The XML::Twig page is at <http://www.xmltwig.org/xmltwig/> It includes
3770 the development version of the module, a slightly better version of the
3771 documentation, examples, a tutorial and a: Processing XML efficiently
3772 with Perl and XML::Twig:
3773 <http://www.xmltwig.org/xmltwig/tutorial/index.html>
3774
3776 Complete docs, including a tutorial, examples, an easier to use HTML
3777 version of the docs, a quick reference card and a FAQ are available at
3778 <http://www.xmltwig.org/xmltwig/>
3779
3780 git repository at <http://github.com/mirod/xmltwig>
3781
3782 XML::Parser, XML::Parser::Expat, XML::XPath, Encode, Text::Iconv,
3783 Scalar::Utils
3784
3785 Alternative Modules
3786 XML::Twig is not the only XML::Processing module available on CPAN (far
3787 from it!).
3788
3789 The main alternative I would recommend is XML::LibXML.
3790
3791 Here is a quick comparison of the 2 modules:
3792
3793 XML::LibXML, actually "libxml2" on which it is based, sticks to the
3794 standards, and implements a good number of them in a rather strict way:
3795 XML, XPath, DOM, RelaxNG, I must be forgetting a couple (XInclude?). It
3796 is fast and rather frugal memory-wise.
3797
3798 XML::Twig is older: when I started writing it XML::Parser/expat was the
3799 only game in town. It implements XML and that's about it (plus a subset
3800 of XPath, and you can use XML::Twig::XPath if you have XML::XPathEngine
3801 installed for full support). It is slower and requires more memory for
3802 a full tree than XML::LibXML. On the plus side (yes, there is a plus
3803 side!) it lets you process a big document in chunks, and thus let you
3804 tackle documents that couldn't be loaded in memory by XML::LibXML, and
3805 it offers a lot (and I mean a LOT!) of higher-level methods, for
3806 everything, from adding structure to "low-level" XML, to shortcuts for
3807 XHTML conversions and more. It also DWIMs quite a bit, getting comments
3808 and non-significant whitespaces out of the way but preserving them in
3809 the output for example. As it does not stick to the DOM, is also
3810 usually leads to shorter code than in XML::LibXML.
3811
3812 Beyond the pure features of the 2 modules, XML::LibXML seems to be
3813 prefered by "XML-purists", while XML::Twig seems to be more used by
3814 Perl Hackers who have to deal with XML. As you have noted, XML::Twig
3815 also comes with quite a lot of docs, but I am sure if you ask for help
3816 about XML::LibXML here or on Perlmonks you will get answers.
3817
3818 Note that it is actually quite hard for me to compare the 2 modules: on
3819 one hand I know XML::Twig inside-out and I can get it to do pretty much
3820 anything I need to (or I improve it ;--), while I have a very basic
3821 knowledge of XML::LibXML. So feature-wise, I'd rather use XML::Twig
3822 ;--). On the other hand, I am painfully aware of some of the
3823 deficiencies, potential bugs and plain ugly code that lurk in
3824 XML::Twig, even though you are unlikely to be affected by them (unless
3825 for example you need to change the DTD of a document programatically),
3826 while I haven't looked much into XML::LibXML so it still looks shinny
3827 and clean to me.
3828
3829 That said, if you need to process a document that is too big to fit
3830 memory and XML::Twig is too slow for you, my reluctant advice would be
3831 to use "bare" XML::Parser. It won't be as easy to use as XML::Twig:
3832 basically with XML::Twig you trade some speed (depending on what you do
3833 from a factor 3 to... none) for ease-of-use, but it will be easier IMHO
3834 than using SAX (albeit not standard), and at this point a LOT faster
3835 (see the last test in
3836 <http://www.xmltwig.org/article/simple_benchmark/>).
3837
3839 Hey! The above document had some coding errors, which are explained
3840 below:
3841
3842 Around line 9528:
3843 Invalid =encoding syntax: utf8 # > perl 5.10.0
3844
3845 Around line 10517:
3846 Non-ASCII character seen before =encoding in 'X"print"'. Assuming
3847 UTF-8
3848
3849
3850
3851perl v5.16.3 2014-06-09 Twig(3)