1XML::LibXML::Parser(3)User Contributed Perl DocumentationXML::LibXML::Parser(3)
2
3
4
6 XML::LibXML::Parser - Parsing XML Data with XML::LibXML
7
9 $parser = XML::LibXML->new();
10 $doc = $parser->parse_file( $xmlfilename );
11 $doc = $parser->parse_fh( $io_fh );
12 $doc = $parser->parse_string( $xmlstring);
13 $doc = $parser->parse_html_file( $htmlfile, \%opts );
14 $doc = $parser->parse_html_fh( $io_fh, \%opts );
15 $doc = $parser->parse_html_string( $htmlstring, \%opts );
16 $fragment = $parser->parse_balanced_chunk( $wbxmlstring );
17 $fragment = $parser->parse_xml_chunk( $wbxmlstring );
18 $parser->process_xincludes( $doc );
19 $parser->processXIncludes( $doc );
20 $parser->parse_chunk($string, $terminate);
21 $parser->start_push();
22 $parser->push(@data);
23 $doc = $parser->finish_push( $recover );
24 $parser->validation(1);
25 $parser->recover(1);
26 $parser->recover_silently(1);
27 $parser->expand_entities(0);
28 $parser->keep_blanks(0);
29 $parser->pedantic_parser(1);
30 $parser->line_numbers(1);
31 $parser->load_ext_dtd(1);
32 $parser->complete_attributes(1);
33 $parser->expand_xinclude(1);
34 $parser->load_catalog( $catalog_file );
35 $parser->base_uri( $your_base_uri );
36 $parser->gdome_dom(1);
37 $parser->clean_namespaces( 1 );
38
41 use XML::LibXML;
42 my $parser = XML::LibXML->new();
43
44 my $doc = $parser->parse_string(<<'EOT');
45 <some-xml/>
46 EOT
47 my $fdoc = $parser->parse_file( $xmlfile );
48
49 my $fhdoc = $parser->parse_fh( $xmlstream );
50
51 my $fragment = $parser->parse_xml_chunk( $xml_wb_chunk );
52
54 A XML document is read into a datastructure such as a DOM tree by a
55 piece of software, called a parser. XML::LibXML currently provides four
56 diffrent parser interfaces:
57
58 · A DOM Pull-Parser
59
60 · A DOM Push-Parser
61
62 · A SAX Parser
63
64 · A DOM based SAX Parser.
65
66 Creating a Parser Instance
67
68 XML::LibXML provides an OO interface to the libxml2 parser functions.
69 Thus you have to create a parser instance before you can parse any XML
70 data.
71
72 new
73 $parser = XML::LibXML->new();
74
75 There is nothing much to say about the constructor. It simply cre‐
76 ates a new parser instance.
77
78 Although libxml2 uses mainly global flags to alter the behaviour of
79 the parser, each XML::LibXML parser instance has its own flags or
80 callbacks and does not interfere with other instances.
81
82 DOM Parser
83
84 One of the common parser interfaces of XML::LibXML is the DOM parser.
85 This parser reads XML data into a DOM like datastructure, so each tag
86 can get accessed and transformed.
87
88 XML::LibXML's DOM parser is not only capable to parse XML data, but
89 also (strict) HTML files. There are three ways to parse documents - as
90 a string, as a Perl filehandle, or as a filename/URL. The return value
91 from each is a XML::LibXML::Document object, which is a DOM object.
92
93 All of the functions listed below will throw an exception if the docu‐
94 ment is invalid. To prevent this causing your program exiting, wrap the
95 call in an eval{} block
96
97 parse_file
98 $doc = $parser->parse_file( $xmlfilename );
99
100 This function parses an XML document from a file or network; $xml‐
101 filename can be either a filename or an URL. Note that for parsing
102 files, this function is the fastest choice, about 6-8 times faster
103 then parse_fh().
104
105 parse_fh
106 $doc = $parser->parse_fh( $io_fh );
107
108 parse_fh() parses a IOREF or a subclass of IO::Handle.
109
110 Because the data comes from an open handle, libxml2's parser does
111 not know about the base URI of the document. To set the base URI
112 one should use parse_fh() as follows:
113
114 my $doc = $parser->parse_fh( $io_fh, $baseuri );
115
116 parse_string
117 $doc = $parser->parse_string( $xmlstring);
118
119 This function is similar to parse_fh(), but it parses a XML docu‐
120 ment that is available as a single string in memory. Again, you can
121 pass an optional base URI to the function.
122
123 my $doc = $parser->parse_string( $xmlstring, $baseuri );
124
125 parse_html_file
126 $doc = $parser->parse_html_file( $htmlfile, \%opts );
127
128 Similar to parse_file() but parses HTML (strict) documents; $html‐
129 file can be filename or URL.
130
131 An optional second argument can be used to pass some options to the
132 HTML parser as a HASH reference. Possible options are: Possible
133 options are: encoding and URI for libxml2 < 2.6.27, and for later
134 versions of libxml2 additionally: recover, suppress_errors, sup‐
135 press_warnings, pedantic_parser, no_blanks, and no_network.
136
137 parse_html_fh
138 $doc = $parser->parse_html_fh( $io_fh, \%opts );
139
140 Similar to parse_fh() but parses HTML (strict) streams.
141
142 An optional second argument can be used to pass some options to the
143 HTML parser as a HASH reference. Possible options are: encoding and
144 URI for libxml2 < 2.6.27, and for later versions of libxml2 addi‐
145 tionally: recover, suppress_errors, suppress_warnings, pedan‐
146 tic_parser, no_blanks, and no_network. Note: encoding option may
147 not work correctly with this function in libxml2 < 2.6.27 if the
148 HTML file declares charset using a META tag.
149
150 parse_html_string
151 $doc = $parser->parse_html_string( $htmlstring, \%opts );
152
153 Similar to parse_string() but parses HTML (strict) strings.
154
155 An optional second argument can be used to pass some options to the
156 HTML parser as a HASH reference. Possible options are: encoding and
157 URI for libxml2 < 2.6.27, and for later versions of libxml2 addi‐
158 tionally: recover, suppress_errors, suppress_warnings, pedan‐
159 tic_parser, no_blanks, and no_network.
160
161 Parsing HTML may cause problems, especially if the ampersand ('&') is
162 used. This is a common problem if HTML code is parsed that contains
163 links to CGI-scripts. Such links cause the parser to throw errors. In
164 such cases libxml2 still parses the entire document as there was no
165 error, but the error causes XML::LibXML to stop the parsing process.
166 However, the document is not lost. Such HTML documents should be
167 parsed using the recover flag. By default recovering is deactivated.
168
169 The functions described above are implemented to parse well formed doc‐
170 uments. In some cases a program gets well balanced XML instead of well
171 formed documents (e.g. a XML fragment from a Database). With
172 XML::LibXML it is not required to wrap such fragments in the code,
173 because XML::LibXML is capable even to parse well balanced XML frag‐
174 ments.
175
176 parse_balanced_chunk
177 $fragment = $parser->parse_balanced_chunk( $wbxmlstring );
178
179 This function parses a well balanced XML string into a
180 XML::LibXML::DocumentFragment.
181
182 parse_xml_chunk
183 $fragment = $parser->parse_xml_chunk( $wbxmlstring );
184
185 This is the old name of parse_balanced_chunk(). Because it may
186 causes confusion with the push parser interface, this function
187 should be used anymore.
188
189 By default XML::LibXML does not process XInclude tags within a XML Doc‐
190 ument (see options section below). XML::LibXML allows to post process a
191 document to expand XInclude tags.
192
193 process_xincludes
194 $parser->process_xincludes( $doc );
195
196 After a document is parsed into a DOM structure, you may want to
197 expand the documents XInclude tags. This function processes the
198 given document structure and expands all XInclude tags (or throws
199 an error) by using the flags and callbacks of the given parser
200 instance.
201
202 Note that the resulting Tree contains some extra nodes (of type
203 XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully pro‐
204 cessing the document. These nodes indicate where data was included
205 into the original tree. if the document is serialized, these extra
206 nodes will not show up.
207
208 Remember: A Document with processed XIncludes differs from the
209 original document after serialization, because the original XIn‐
210 clude tags will not get restored!
211
212 If the parser flag "expand_xincludes" is set to 1, you need not to
213 post process the parsed document.
214
215 processXIncludes
216 $parser->processXIncludes( $doc );
217
218 This is an alias to process_xincludes, but through a JAVA like
219 function name.
220
221 Push Parser
222
223 XML::LibXML provides a push parser interface. Rather than pulling the
224 data from a given source the push parser waits for the data to be
225 pushed into it.
226
227 This allows one to parse large documents without waiting for the parser
228 to finish. The interface is especially useful if a program needs to
229 preprocess the incoming pieces of XML (e.g. to detect document bound‐
230 aries).
231
232 While XML::LibXML parse_*() functions force the data to be a wellformed
233 XML, the push parser will take any arbitrary string that contains some
234 XML data. The only requirement is that all the pushed strings are
235 together a well formed document. With the push parser interface a pro‐
236 gramm can interrupt the parsing process as required, where the
237 parse_*() functions give not enough flexibility.
238
239 Different to the pull parser implemented in parse_fh() or parse_file(),
240 the push parser is not able to find out about the documents end itself.
241 Thus the calling program needs to indicate explicitly when the parsing
242 is done.
243
244 In XML::LibXML this is done by a single function:
245
246 parse_chunk
247 $parser->parse_chunk($string, $terminate);
248
249 parse_chunk() tries to parse a given chunk of data, which isn't
250 nessecarily well balanced data. The function takes two parameters:
251 The chunk of data as a string and optional a termination flag. If
252 the termination flag is set to a true value (e.g. 1), the parsing
253 will be stopped and the resulting document will be returned as the
254 following exable describes:
255
256 my $parser = XML::LibXML->new;
257 for my $string ( "<", "foo", ' bar="hello worls"', "/>") {
258 $parser->parse_chunk( $string );
259 }
260 my $doc = $parser->parse_chunk("", 1); # terminate the parsing
261
262 Internally XML::LibXML provides three functions that control the push
263 parser process:
264
265 start_push
266 $parser->start_push();
267
268 Initializes the push parser.
269
270 push
271 $parser->push(@data);
272
273 This function pushes the data stored inside the array to libxml2's
274 parser. Each entry in @data must be a normal scalar!
275
276 finish_push
277 $doc = $parser->finish_push( $recover );
278
279 This function returns the result of the parsing process. If this
280 function is called without a parameter it will complain about non
281 wellformed documents. If $restore is 1, the push parser can be used
282 to restore broken or non well formed (XML) documents as the follow‐
283 ing example shows:
284
285 eval {
286 $parser->push( "<foo>", "bar" );
287 $doc = $parser->finish_push(); # will report broken XML
288 };
289 if ( $@ ) {
290 # ...
291 }
292
293 This can be annoying if the closing tag is missed by accident. The
294 following code will restore the document:
295
296 eval {
297 $parser->push( "<foo>", "bar" );
298 $doc = $parser->finish_push(1); # will return the data parsed
299 # unless an error happened
300 };
301
302 print $doc->toString(); # returns "<foo>bar</foo>"
303
304 Of course finish_push() will return nothing if there was no data
305 pushed to the parser before.
306
307 DOM based SAX Parser
308
309 XML::LibXML provides a DOM based SAX parser. The SAX parser is defined
310 in XML::LibXML::SAX::Parser. As it is not a stream based parser, it
311 parses documents into a DOM and traverses the DOM tree instead.
312
313 The API of this parser is exactly the same as any other Perl SAX2
314 parser. See XML::SAX::Intro for details.
315
316 Aside from the regular parsing methods, you can access the DOM tree
317 traverser directly, using the generate() method:
318
319 my $doc = build_yourself_a_document();
320 my $saxparser = $XML::LibXML::SAX::Parser->new( ... );
321 $parser->generate( $doc );
322
323 This is useful for serializing DOM trees, for example that you might
324 have done prior processing on, or that you have as a result of XSLT
325 processing.
326
327 WARNING
328
329 This is NOT a streaming SAX parser. As I said above, this parser reads
330 the entire document into a DOM and serialises it. Some people couldn't
331 read that in the paragraph above so I've added this warning.
332
333 If you want a streaming SAX parser look at the XML::LibXML::SAX man
334 page
335
337 XML::LibXML provides some functions to serialize nodes and documents.
338 The serialization functions are described on the XML::LibXML::Node man‐
339 page or the XML::LibXML::Document manpage. XML::LibXML checks three
340 global flags that alter the serialization process:
341
342 · skipXMLDeclaration
343
344 · skipDTD
345
346 · setTagCompression
347
348 of that three functions only setTagCompression is available for all
349 serialization functions.
350
351 Because XML::LibXML does these flags not itself, one has to define them
352 locally as the following example shows:
353
354 local $XML::LibXML::skipXMLDeclaration = 1;
355 local $XML::LibXML::skipDTD = 1;
356 local $XML::LibXML::setTagCompression = 1;
357
358 If skipXMLDeclaration is defined and not '0', the XML declaration is
359 omitted during serialization.
360
361 If skipDTD is defined and not '0', an existing DTD would not be serial‐
362 ized with the document.
363
364 If setTagCompression is defined and not '0' empty tags are displayed as
365 open and closing tags ranther than the shortcut. For example the empty
366 tag foo will be rendered as <foo></foo> rather than <foo/>.
367
369 LibXML options are global (unfortunately this is a limitation of the
370 underlying implementation, not this interface). They can either be set
371 using $parser->option(...), or XML::LibXML->option(...), both are
372 treated in the same manner. Note that even two parser processes will
373 share some of the same options, so be careful out there!
374
375 Every option returns the previous value, and can be called without
376 parameters to get the current value.
377
378 validation
379 $parser->validation(1);
380
381 Turn validation on (or off). Defaults to off.
382
383 recover
384 $parser->recover(1);
385
386 Turn the parsers recover mode on (or off). Defaults to off.
387
388 This allows one to parse broken XML data into memory. This switch
389 will only work with XML data rather than HTML data. Also the vali‐
390 dation will be switched off automaticly.
391
392 The recover mode helps to recover documents that are almost well‐
393 formed very efficiently. That is for example a document that for‐
394 gets to close the document tag (or any other tag inside the docu‐
395 ment). The recover mode of XML::LibXML has problems restoring docu‐
396 ments that are more like well ballanced chunks.
397
398 XML::LibXML will only parse until the first fatal error occours,
399 reporting recoverable parsing errors as warnings. To suppress these
400 warnings use $parser->recover_silently(1); or, equivalently,
401 $parser->recover(2).
402
403 recover_silently
404 $parser->recover_silently(1);
405
406 Turns the parser warnings off (or on). Defaults to on.
407
408 This allows to switch off warnings printed to STDERR when parsing
409 documents with recover(1).
410
411 Please note that calling recover_silently(0) also turns the parser
412 recover mode off and calling recover_silently(1) automatically
413 activates the parser recover mode.
414
415 expand_entities
416 $parser->expand_entities(0);
417
418 Turn entity expansion on or off, enabled by default. If entity
419 expansion is off, any external parsed entities in the document are
420 left as entities. Probably not very useful for most purposes.
421
422 keep_blanks
423 $parser->keep_blanks(0);
424
425 Allows you to turn off XML::LibXML's default behaviour of maintain‐
426 ing whitespace in the document.
427
428 pedantic_parser
429 $parser->pedantic_parser(1);
430
431 You can make XML::LibXML more pedantic if you want to.
432
433 line_numbers
434 $parser->line_numbers(1);
435
436 If this option is activated XML::LibXML will store the line number
437 of a node. This gives more information where a validation error
438 occoured. It could be also used to find out about the position of a
439 node after parsing (see also XML::LibXML::Node::line_number())
440
441 By default line numbering is switched off (0).
442
443 load_ext_dtd
444 $parser->load_ext_dtd(1);
445
446 Load external DTD subsets while parsing.
447
448 This flag is also required for DTD Validation, to provide complete
449 attribute, and to expand entities, regardless if the document has
450 an internal subset. Thus switching off external DTD loading, will
451 disable entity expansion, validation, and complete attributes on
452 internal subsets as well.
453
454 If you leave this parser flag untouched, everythig will work,
455 because the default is 1 (activated)
456
457 complete_attributes
458 $parser->complete_attributes(1);
459
460 Complete the elements attributes lists with the ones defaulted from
461 the DTDs. By default, this option is enabled.
462
463 expand_xinclude
464 $parser->expand_xinclude(1);
465
466 Expands XIinclude tags immediately while parsing the document. This
467 flag assures that the parser callbacks are used while parsing the
468 included document.
469
470 load_catalog
471 $parser->load_catalog( $catalog_file );
472
473 Will use $catalog_file as a catalog during all parsing processes.
474 Using a catalog will significantly speed up parsing processes if
475 many external resources are loaded into the parsed documents (such
476 as DTDs or XIncludes).
477
478 Note that catalogs will not be available if an external entity han‐
479 dler was specified. At the current state it is not possible to make
480 use of both types of resolving systems at the same time.
481
482 base_uri
483 $parser->base_uri( $your_base_uri );
484
485 In case of parsing strings or file handles, XML::LibXML doesn't
486 know about the base uri of the document. To make relative refer‐
487 ences such as XIncludes work, one has to set a separate base URI,
488 that is then used for the parsed documents.
489
490 gdome_dom
491 $parser->gdome_dom(1);
492
493 THIS FLAG IS EXPERIMENTAL!
494
495 Although quite powerful XML:LibXML's DOM implementation is limited
496 if one needs or wants full DOM level 2 or level 3 support.
497 XML::GDOME is based on libxml2 as well but provides a rather com‐
498 plete DOM implementation by wrapping libgdome. This allows you to
499 make use of XML::LibXML's full parser options and XML::GDOME's DOM
500 implementation at the same time.
501
502 To make use of this function, one has to install libgdome and con‐
503 figure XML::LibXML to use this library. For this you need to
504 rebuild XML::LibXML!
505
506 clean_namespaces
507 $parser->clean_namespaces( 1 );
508
509 libxml2 2.6.0 and later allows to strip redundant namespace decla‐
510 rations from the DOM tree. To do this, one has to set clean_names‐
511 paces() to 1 (TRUE). By default no namespace cleanup is done.
512
514 XML::LibXML throws exceptions during parsing, validation or XPath pro‐
515 cessing (and some other occations). These errors can be caught by using
516 eval blocks. The error then will be stored in $@.
517
518 XML::LibXML throws errors as they occour and does not wait if a user
519 test for them. This is a very common misunderstanding in the use of
520 XML::LibXML. If the eval is ommited, XML::LibXML will allways halt your
521 script by "croaking" (see Carp man page for details).
522
523 Also note that an increasing number of functions throw errors if bad
524 data is passed. If you cannot asure valid data passed to XML::LibXML
525 you should eval these functions.
526
527 Note: since version 1.59, get_last_error() is no longer available in
528 XML::LibXML for thread-safety reasons.
529
531 Matt Sergeant, Christian Glahn, Petr Pajas,
532
534 1.62
535
537 2001-2006, AxKit.com Ltd; 2002-2006 Christian Glahn; 2006 Petr Pajas,
538 All rights reserved.
539
540
541
542perl v5.8.8 2006-11-17 XML::LibXML::Parser(3)