1XML::LibXML::Parser(3)User Contributed Perl DocumentationXML::LibXML::Parser(3)
2
3
4

NAME

6       XML::LibXML::Parser - Parsing XML Data with XML::LibXML
7

SYNOPSIS

9         $parser = XML::LibXML->new();
10         $doc = $parser->parse_file( $xmlfilename );
11         $doc = $parser->parse_fh( $io_fh );
12         $doc = $parser->parse_string( $xmlstring);
13         $doc = $parser->parse_html_file( $htmlfile, \%opts );
14         $doc = $parser->parse_html_fh( $io_fh, \%opts );
15         $doc = $parser->parse_html_string( $htmlstring, \%opts );
16         $fragment = $parser->parse_balanced_chunk( $wbxmlstring );
17         $fragment = $parser->parse_xml_chunk( $wbxmlstring );
18         $parser->process_xincludes( $doc );
19         $parser->processXIncludes( $doc );
20         $parser->parse_chunk($string, $terminate);
21         $parser->start_push();
22         $parser->push(@data);
23         $doc = $parser->finish_push( $recover );
24         $parser->validation(1);
25         $parser->recover(1);
26         $parser->recover_silently(1);
27         $parser->expand_entities(0);
28         $parser->keep_blanks(0);
29         $parser->pedantic_parser(1);
30         $parser->line_numbers(1);
31         $parser->load_ext_dtd(1);
32         $parser->complete_attributes(1);
33         $parser->expand_xinclude(1);
34         $parser->load_catalog( $catalog_file );
35         $parser->base_uri( $your_base_uri );
36         $parser->gdome_dom(1);
37         $parser->clean_namespaces( 1 );
38

DESCRIPTION

SYNOPSIS

41         use XML::LibXML;
42         my $parser = XML::LibXML->new();
43
44         my $doc = $parser->parse_string(<<'EOT');
45         <some-xml/>
46         EOT
47         my $fdoc = $parser->parse_file( $xmlfile );
48
49         my $fhdoc = $parser->parse_fh( $xmlstream );
50
51         my $fragment = $parser->parse_xml_chunk( $xml_wb_chunk );
52

PARSING

54       A XML document is read into a datastructure such as a DOM tree by a
55       piece of software, called a parser. XML::LibXML currently provides four
56       diffrent parser interfaces:
57
58       ·   A DOM Pull-Parser
59
60       ·   A DOM Push-Parser
61
62       ·   A SAX Parser
63
64       ·   A DOM based SAX Parser.
65
66       Creating a Parser Instance
67
68       XML::LibXML provides an OO interface to the libxml2 parser functions.
69       Thus you have to create a parser instance before you can parse any XML
70       data.
71
72       new
73             $parser = XML::LibXML->new();
74
75           There is nothing much to say about the constructor. It simply cre‐
76           ates a new parser instance.
77
78           Although libxml2 uses mainly global flags to alter the behaviour of
79           the parser, each XML::LibXML parser instance has its own flags or
80           callbacks and does not interfere with other instances.
81
82       DOM Parser
83
84       One of the common parser interfaces of XML::LibXML is the DOM parser.
85       This parser reads XML data into a DOM like datastructure, so each tag
86       can get accessed and transformed.
87
88       XML::LibXML's DOM parser is not only capable to parse XML data, but
89       also (strict) HTML files. There are three ways to parse documents - as
90       a string, as a Perl filehandle, or as a filename/URL. The return value
91       from each is a XML::LibXML::Document object, which is a DOM object.
92
93       All of the functions listed below will throw an exception if the docu‐
94       ment is invalid. To prevent this causing your program exiting, wrap the
95       call in an eval{} block
96
97       parse_file
98             $doc = $parser->parse_file( $xmlfilename );
99
100           This function parses an XML document from a file or network; $xml‐
101           filename can be either a filename or an URL. Note that for parsing
102           files, this function is the fastest choice, about 6-8 times faster
103           then parse_fh().
104
105       parse_fh
106             $doc = $parser->parse_fh( $io_fh );
107
108           parse_fh() parses a IOREF or a subclass of IO::Handle.
109
110           Because the data comes from an open handle, libxml2's parser does
111           not know about the base URI of the document. To set the base URI
112           one should use parse_fh() as follows:
113
114             my $doc = $parser->parse_fh( $io_fh, $baseuri );
115
116       parse_string
117             $doc = $parser->parse_string( $xmlstring);
118
119           This function is similar to parse_fh(), but it parses a XML docu‐
120           ment that is available as a single string in memory. Again, you can
121           pass an optional base URI to the function.
122
123             my $doc = $parser->parse_string( $xmlstring, $baseuri );
124
125       parse_html_file
126             $doc = $parser->parse_html_file( $htmlfile, \%opts );
127
128           Similar to parse_file() but parses HTML (strict) documents; $html‐
129           file can be filename or URL.
130
131           An optional second argument can be used to pass some options to the
132           HTML parser as a HASH reference. Possible options are: Possible
133           options are: encoding and URI for libxml2 < 2.6.27, and for later
134           versions of libxml2 additionally: recover, suppress_errors, sup‐
135           press_warnings, pedantic_parser, no_blanks, and no_network.
136
137       parse_html_fh
138             $doc = $parser->parse_html_fh( $io_fh, \%opts );
139
140           Similar to parse_fh() but parses HTML (strict) streams.
141
142           An optional second argument can be used to pass some options to the
143           HTML parser as a HASH reference. Possible options are: encoding and
144           URI for libxml2 < 2.6.27, and for later versions of libxml2 addi‐
145           tionally: recover, suppress_errors, suppress_warnings, pedan‐
146           tic_parser, no_blanks, and no_network.  Note: encoding option may
147           not work correctly with this function in libxml2 < 2.6.27 if the
148           HTML file declares charset using a META tag.
149
150       parse_html_string
151             $doc = $parser->parse_html_string( $htmlstring, \%opts );
152
153           Similar to parse_string() but parses HTML (strict) strings.
154
155           An optional second argument can be used to pass some options to the
156           HTML parser as a HASH reference. Possible options are: encoding and
157           URI for libxml2 < 2.6.27, and for later versions of libxml2 addi‐
158           tionally: recover, suppress_errors, suppress_warnings, pedan‐
159           tic_parser, no_blanks, and no_network.
160
161       Parsing HTML may cause problems, especially if the ampersand ('&') is
162       used.  This is a common problem if HTML code is parsed that contains
163       links to CGI-scripts. Such links cause the parser to throw errors. In
164       such cases libxml2 still parses the entire document as there was no
165       error, but the error causes XML::LibXML to stop the parsing process.
166       However, the document is not lost.  Such HTML documents should be
167       parsed using the recover flag. By default recovering is deactivated.
168
169       The functions described above are implemented to parse well formed doc‐
170       uments.  In some cases a program gets well balanced XML instead of well
171       formed documents (e.g. a XML fragment from a Database). With
172       XML::LibXML it is not required to wrap such fragments in the code,
173       because XML::LibXML is capable even to parse well balanced XML frag‐
174       ments.
175
176       parse_balanced_chunk
177             $fragment = $parser->parse_balanced_chunk( $wbxmlstring );
178
179           This function parses a well balanced XML string into a
180           XML::LibXML::DocumentFragment.
181
182       parse_xml_chunk
183             $fragment = $parser->parse_xml_chunk( $wbxmlstring );
184
185           This is the old name of parse_balanced_chunk(). Because it may
186           causes confusion with the push parser interface, this function
187           should be used anymore.
188
189       By default XML::LibXML does not process XInclude tags within a XML Doc‐
190       ument (see options section below). XML::LibXML allows to post process a
191       document to expand XInclude tags.
192
193       process_xincludes
194             $parser->process_xincludes( $doc );
195
196           After a document is parsed into a DOM structure, you may want to
197           expand the documents XInclude tags. This function processes the
198           given document structure and expands all XInclude tags (or throws
199           an error) by using the flags and callbacks of the given parser
200           instance.
201
202           Note that the resulting Tree contains some extra nodes (of type
203           XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully pro‐
204           cessing the document. These nodes indicate where data was included
205           into the original tree.  if the document is serialized, these extra
206           nodes will not show up.
207
208           Remember: A Document with processed XIncludes differs from the
209           original document after serialization, because the original XIn‐
210           clude tags will not get restored!
211
212           If the parser flag "expand_xincludes" is set to 1, you need not to
213           post process the parsed document.
214
215       processXIncludes
216             $parser->processXIncludes( $doc );
217
218           This is an alias to process_xincludes, but through a JAVA like
219           function name.
220
221       Push Parser
222
223       XML::LibXML provides a push parser interface. Rather than pulling the
224       data from a given source the push parser waits for the data to be
225       pushed into it.
226
227       This allows one to parse large documents without waiting for the parser
228       to finish. The interface is especially useful if a program needs to
229       preprocess the incoming pieces of XML (e.g. to detect document bound‐
230       aries).
231
232       While XML::LibXML parse_*() functions force the data to be a wellformed
233       XML, the push parser will take any arbitrary string that contains some
234       XML data. The only requirement is that all the pushed strings are
235       together a well formed document. With the push parser interface a pro‐
236       gramm can interrupt the parsing process as required, where the
237       parse_*() functions give not enough flexibility.
238
239       Different to the pull parser implemented in parse_fh() or parse_file(),
240       the push parser is not able to find out about the documents end itself.
241       Thus the calling program needs to indicate explicitly when the parsing
242       is done.
243
244       In XML::LibXML this is done by a single function:
245
246       parse_chunk
247             $parser->parse_chunk($string, $terminate);
248
249           parse_chunk() tries to parse a given chunk of data, which isn't
250           nessecarily well balanced data. The function takes two parameters:
251           The chunk of data as a string and optional a termination flag. If
252           the termination flag is set to a true value (e.g. 1), the parsing
253           will be stopped and the resulting document will be returned as the
254           following exable describes:
255
256             my $parser = XML::LibXML->new;
257             for my $string ( "<", "foo", ' bar="hello worls"', "/>") {
258                  $parser->parse_chunk( $string );
259             }
260             my $doc = $parser->parse_chunk("", 1); # terminate the parsing
261
262       Internally XML::LibXML provides three functions that control the push
263       parser process:
264
265       start_push
266             $parser->start_push();
267
268           Initializes the push parser.
269
270       push
271             $parser->push(@data);
272
273           This function pushes the data stored inside the array to libxml2's
274           parser. Each entry in @data must be a normal scalar!
275
276       finish_push
277             $doc = $parser->finish_push( $recover );
278
279           This function returns the result of the parsing process. If this
280           function is called without a parameter it will complain about non
281           wellformed documents. If $restore is 1, the push parser can be used
282           to restore broken or non well formed (XML) documents as the follow‐
283           ing example shows:
284
285             eval {
286                 $parser->push( "<foo>", "bar" );
287                 $doc = $parser->finish_push();    # will report broken XML
288             };
289             if ( $@ ) {
290                # ...
291             }
292
293           This can be annoying if the closing tag is missed by accident. The
294           following code will restore the document:
295
296             eval {
297                 $parser->push( "<foo>", "bar" );
298                 $doc = $parser->finish_push(1);   # will return the data parsed
299                                                   # unless an error happened
300             };
301
302             print $doc->toString(); # returns "<foo>bar</foo>"
303
304           Of course finish_push() will return nothing if there was no data
305           pushed to the parser before.
306
307       DOM based SAX Parser
308
309       XML::LibXML provides a DOM based SAX parser. The SAX parser is defined
310       in XML::LibXML::SAX::Parser. As it is not a stream based parser, it
311       parses documents into a DOM and traverses the DOM tree instead.
312
313       The API of this parser is exactly the same as any other Perl SAX2
314       parser. See XML::SAX::Intro for details.
315
316       Aside from the regular parsing methods, you can access the DOM tree
317       traverser directly, using the generate() method:
318
319         my $doc = build_yourself_a_document();
320         my $saxparser = $XML::LibXML::SAX::Parser->new( ... );
321         $parser->generate( $doc );
322
323       This is useful for serializing DOM trees, for example that you might
324       have done prior processing on, or that you have as a result of XSLT
325       processing.
326
327       WARNING
328
329       This is NOT a streaming SAX parser. As I said above, this parser reads
330       the entire document into a DOM and serialises it. Some people couldn't
331       read that in the paragraph above so I've added this warning.
332
333       If you want a streaming SAX parser look at the XML::LibXML::SAX man
334       page
335

SERIALIZATION

337       XML::LibXML provides some functions to serialize nodes and documents.
338       The serialization functions are described on the XML::LibXML::Node man‐
339       page or the XML::LibXML::Document manpage. XML::LibXML checks three
340       global flags that alter the serialization process:
341
342       ·   skipXMLDeclaration
343
344       ·   skipDTD
345
346       ·   setTagCompression
347
348       of that three functions only setTagCompression is available for all
349       serialization functions.
350
351       Because XML::LibXML does these flags not itself, one has to define them
352       locally as the following example shows:
353
354         local $XML::LibXML::skipXMLDeclaration = 1;
355         local $XML::LibXML::skipDTD = 1;
356         local $XML::LibXML::setTagCompression = 1;
357
358       If skipXMLDeclaration is defined and not '0', the XML declaration is
359       omitted during serialization.
360
361       If skipDTD is defined and not '0', an existing DTD would not be serial‐
362       ized with the document.
363
364       If setTagCompression is defined and not '0' empty tags are displayed as
365       open and closing tags ranther than the shortcut. For example the empty
366       tag foo will be rendered as <foo></foo> rather than <foo/>.
367

PARSER OPTIONS

369       LibXML options are global (unfortunately this is a limitation of the
370       underlying implementation, not this interface). They can either be set
371       using $parser->option(...), or XML::LibXML->option(...), both are
372       treated in the same manner. Note that even two parser processes will
373       share some of the same options, so be careful out there!
374
375       Every option returns the previous value, and can be called without
376       parameters to get the current value.
377
378       validation
379             $parser->validation(1);
380
381           Turn validation on (or off). Defaults to off.
382
383       recover
384             $parser->recover(1);
385
386           Turn the parsers recover mode on (or off). Defaults to off.
387
388           This allows one to parse broken XML data into memory. This switch
389           will only work with XML data rather than HTML data. Also the vali‐
390           dation will be switched off automaticly.
391
392           The recover mode helps to recover documents that are almost well‐
393           formed very efficiently. That is for example a document that for‐
394           gets to close the document tag (or any other tag inside the docu‐
395           ment). The recover mode of XML::LibXML has problems restoring docu‐
396           ments that are more like well ballanced chunks.
397
398           XML::LibXML will only parse until the first fatal error occours,
399           reporting recoverable parsing errors as warnings. To suppress these
400           warnings use $parser->recover_silently(1); or, equivalently,
401           $parser->recover(2).
402
403       recover_silently
404             $parser->recover_silently(1);
405
406           Turns the parser warnings off (or on). Defaults to on.
407
408           This allows to switch off warnings printed to STDERR when parsing
409           documents with recover(1).
410
411           Please note that calling recover_silently(0) also turns the parser
412           recover mode off and calling recover_silently(1) automatically
413           activates the parser recover mode.
414
415       expand_entities
416             $parser->expand_entities(0);
417
418           Turn entity expansion on or off, enabled by default. If entity
419           expansion is off, any external parsed entities in the document are
420           left as entities.  Probably not very useful for most purposes.
421
422       keep_blanks
423             $parser->keep_blanks(0);
424
425           Allows you to turn off XML::LibXML's default behaviour of maintain‐
426           ing whitespace in the document.
427
428       pedantic_parser
429             $parser->pedantic_parser(1);
430
431           You can make XML::LibXML more pedantic if you want to.
432
433       line_numbers
434             $parser->line_numbers(1);
435
436           If this option is activated XML::LibXML will store the line number
437           of a node.  This gives more information where a validation error
438           occoured. It could be also used to find out about the position of a
439           node after parsing (see also XML::LibXML::Node::line_number())
440
441           By default line numbering is switched off (0).
442
443       load_ext_dtd
444             $parser->load_ext_dtd(1);
445
446           Load external DTD subsets while parsing.
447
448           This flag is also required for DTD Validation, to provide complete
449           attribute, and to expand entities, regardless if the document has
450           an internal subset. Thus switching off external DTD loading, will
451           disable entity expansion, validation, and complete attributes on
452           internal subsets as well.
453
454           If you leave this parser flag untouched, everythig will work,
455           because the default is 1 (activated)
456
457       complete_attributes
458             $parser->complete_attributes(1);
459
460           Complete the elements attributes lists with the ones defaulted from
461           the DTDs.  By default, this option is enabled.
462
463       expand_xinclude
464             $parser->expand_xinclude(1);
465
466           Expands XIinclude tags immediately while parsing the document. This
467           flag assures that the parser callbacks are used while parsing the
468           included document.
469
470       load_catalog
471             $parser->load_catalog( $catalog_file );
472
473           Will use $catalog_file as a catalog during all parsing processes.
474           Using a catalog will significantly speed up parsing processes if
475           many external resources are loaded into the parsed documents (such
476           as DTDs or XIncludes).
477
478           Note that catalogs will not be available if an external entity han‐
479           dler was specified. At the current state it is not possible to make
480           use of both types of resolving systems at the same time.
481
482       base_uri
483             $parser->base_uri( $your_base_uri );
484
485           In case of parsing strings or file handles, XML::LibXML doesn't
486           know about the base uri of the document. To make relative refer‐
487           ences such as XIncludes work, one has to set a separate base URI,
488           that is then used for the parsed documents.
489
490       gdome_dom
491             $parser->gdome_dom(1);
492
493           THIS FLAG IS EXPERIMENTAL!
494
495           Although quite powerful XML:LibXML's DOM implementation is limited
496           if one needs or wants full DOM level 2 or level 3 support.
497           XML::GDOME is based on libxml2 as well but provides a rather com‐
498           plete DOM implementation by wrapping libgdome.  This allows you to
499           make use of XML::LibXML's full parser options and XML::GDOME's DOM
500           implementation at the same time.
501
502           To make use of this function, one has to install libgdome and con‐
503           figure XML::LibXML to use this library. For this you need to
504           rebuild XML::LibXML!
505
506       clean_namespaces
507             $parser->clean_namespaces( 1 );
508
509           libxml2 2.6.0 and later allows to strip redundant namespace decla‐
510           rations from the DOM tree. To do this, one has to set clean_names‐
511           paces() to 1 (TRUE). By default no namespace cleanup is done.
512

ERROR REPORTING

514       XML::LibXML throws exceptions during parsing, validation or XPath pro‐
515       cessing (and some other occations). These errors can be caught by using
516       eval blocks.  The error then will be stored in $@.
517
518       XML::LibXML throws errors as they occour and does not wait if a user
519       test for them. This is a very common misunderstanding in the use of
520       XML::LibXML. If the eval is ommited, XML::LibXML will allways halt your
521       script by "croaking" (see Carp man page for details).
522
523       Also note that an increasing number of functions throw errors if bad
524       data is passed. If you cannot asure valid data passed to XML::LibXML
525       you should eval these functions.
526
527       Note: since version 1.59, get_last_error() is no longer available in
528       XML::LibXML for thread-safety reasons.
529

AUTHORS

531       Matt Sergeant, Christian Glahn, Petr Pajas,
532

VERSION

534       1.62
535
537       2001-2006, AxKit.com Ltd; 2002-2006 Christian Glahn; 2006 Petr Pajas,
538       All rights reserved.
539
540
541
542perl v5.8.8                       2006-11-17            XML::LibXML::Parser(3)
Impressum