XML::LibXML::Reader(3pm)

1XML::LibXML::Reader(3)User Contributed Perl DocumentationXML::LibXML::Reader(3)
2
3
4

NAME

6       XML::LibXML::Reader - XML::LibXML::Reader - interface to libxml2 pull
7       parser
8

SYNOPSIS

10         use XML::LibXML::Reader;
11
12         $reader = new XML::LibXML::Reader("file.xml")
13                or die "cannot read file.xml\n";
14         while ($reader->read) {
15           processNode($reader);
16         }
17
18         sub processNode {
19             $reader = shift;
20             printf "%d %d %s %d\n", ($reader->depth,
21                                      $reader->nodeType,
22                                      $reader->name,
23                                      $reader->isEmptyElement);
24         }
25
26       or
27
28           $reader = new XML::LibXML::Reader("file.xml")
29                or die "cannot read file.xml\n";
30           $reader->preservePattern('//table/tr');
31           $reader->finish;
32           print $reader->document->toString(1);
33

DESCRIPTION

35       This is a perl interface to libxml2's pull-parser implementation xml‐
36       TextReader http://xmlsoft.org/html/libxml-xmlreader.html. Pull-parser
37       (StAX in Java, XmlReader in C#) use an iterator approach to parse a
38       xml-file. They are easier to program than event-based parser (SAX) and
39       much more lightweight than tree-based parser (DOM), which load the com‐
40       plete tree into memory.
41
42       The Reader acts as a cursor going forward on the document stream and
43       stopping at each node in the way. At every point DOM-like methods of
44       the Reader object allow to examine the current node (name, namespace,
45       attributes, etc.)
46
47       The user's code keeps control of the progress and simply calls the
48       read() function repeatedly to progress to the next node in the document
49       order. Other functions provide means for skipping complete subtrees, or
50       nodes until a specific element, etc.
51
52       At every time, only a very limitted portion of the document is kept in
53       the memory, which makes the API more memory-efficient than using DOM.
54       However, it is also possible to mix Reader with DOM. At every point the
55       user may copy the current node (optionally expanded into a complete
56       subtree) from the processed document to another DOM tree, or to
57       instruct the Reader to collect sub-document in form of a DOM tree con‐
58       sisting of selected nodes.
59
60       Reader API also supports namespaces, xml:base, entity handling, and DTD
61       validation. Schema and RelaxNG validation support will probably be
62       added in some later revision of the Perl interface.
63
64       The naming of methods compared to libxml2 and C# XmlTextReader has been
65       changed slightly to match the conventions of XML::LibXML. Some func‐
66       tions have been changed or added with respect to the C interface.
67

CONSTRUCTOR

69       Depending on the XML source, the Reader object can be created with
70       either of:
71
72           my $reader = XML::LibXML::Reader->new( location => "file.xml", ... );
73           my $reader = XML::LibXML::Reader->new( string => $xml_string, ... );
74           my $reader = XML::LibXML::Reader->new( IO => $file_handle, ... );
75           my $reader = XML::LibXML::Reader->new( DOM => $dom, ... );
76
77       where ... are (optional) reader options described below in Parser
78       options. The constructor recognizes the following XML sources:
79
80       Source specification
81
82       location
83           Read XML from a local file or URL.
84
85       string
86           Read XML from a string.
87
88       IO  Read XML a Perl IO filehandle.
89
90       FD  Read XML from a file descriptor (bypasses Perl I/O layer, only
91           applicable to filehandles for regular files or pipes). Possibly
92           faster than IO.
93
94       DOM Use reader API to walk through a preparsed XML::LibXML::Document.
95
96       Parsing options
97
98       URI can be used to provide baseURI when parsing strings or filehandles.
99
100       encoding
101           override document encoding.
102
103       RelaxNG
104           can be used to pass either a XML::LibXML::RelaxNG object or a file‐
105           name or URL of a RelaxNG schema to the constructor. The schema is
106           then used to validate the document as it is processed.
107
108       Schema
109           can be used to pass either a XML::LibXML::Schema object or a file‐
110           name or URL of a W3C XSD schema to the constructor. The schema is
111           then used to validate the document as it is processed.
112
113       recover
114           recover on errors (0 or 1)
115
116       expand_entities
117           substitute entities (0 or 1)
118
119       load_ext_dtd
120           load the external subset (0 or 1)
121
122       complete_attributes
123           default DTD attributes (0 or 1)
124
125       validation
126           validate with the DTD (0 or 1)
127
128       suppress_errors
129           suppress error reports (0 or 1)
130
131       suppress_warnings
132           suppress warning reports (0 or 1)
133
134       pedantic_parser
135           pedantic error reporting (0 or 1)
136
137       no_blanks
138           remove blank nodes (0 or 1)
139
140       expand_xinclude
141           Implement XInclude substitition (0 or 1)
142
143       no_network
144           Forbid network access (0 or 1)
145
146       clean_namespaces
147           remove redundant namespaces declarations (0 or 1)
148
149       no_cdata
150           merge CDATA as text nodes (0 or 1)
151
152       no_xinclude_nodes
153           do not generate XINCLUDE START/END nodes (0 or 1)
154

METHODS CONTROLLING PARSING PROGRESS

156       read ()
157           Moves the position to the next node in the stream, exposing its
158           properties.
159
160           Returns 1 if the node was read successfully, 0 if there is no more
161           nodes to read, or -1 in case of error
162
163       readAttributeValue ()
164           Parses an attribute value into one or more Text and EntityReference
165           nodes.
166
167           Returns 1 in case of success, 0 if the reader was not positionned
168           on an attribute node or all the attribute values have been read, or
169           -1 in case of error.
170
171       readState ()
172           Gets the read state of the reader. Returns the state value, or -1
173           in case of error. The module exports constants for the Reader
174           states, see STATES below.
175
176       depth ()
177           The depth of the node in the tree, starts at 0 for the root node.
178
179       next ()
180           Skip to the node following the current one in the document order
181           while avoiding the subtree if any. Returns 1 if the node was read
182           successfully, 0 if there is no more nodes to read, or -1 in case of
183           error.
184
185       nextElement (localname?,nsURI?)
186           Skip nodes following the current one in the document order until a
187           specific element is reached. The element's name must be equal to a
188           given localname if defined, and its namespace must equal to a given
189           nsURI if defined. Either of the arguments can be undefined (or
190           omitted, in case of the latter or both).
191
192           Returns 1 if the element was found, 0 if there is no more nodes to
193           read, or -1 in case of error.
194
195       skipSiblings ()
196           Skip all nodes on the same or lower level until the first node on a
197           higher level is reached. In particular, if the current node occurs
198           in an element, the reader stops at the end tag of the parent ele‐
199           ment, otherwise it stops at a node immediately following the parent
200           node.
201
202           Returns 1 if successful, 0 if end of the document is reached, or -1
203           in case of error.
204
205       nextSibling ()
206           It skips to the node following the current one in the document
207           order while avoiding the subtree if any.
208
209           Returns 1 if the node was read successfully, 0 if there is no more
210           nodes to read, or -1 in case of error
211
212       nextSiblingElement (name?,nsURI?)
213           Like nextElement but only processes sibling elements of the current
214           node (moving forward using nextSibling () rather than read (),
215           internally).
216
217           Returns 1 if the element was found, 0 if there is no more sibling
218           nodes, or -1 in case of error.
219
220       finish ()
221           Skip all remaining nodes in the document, reaching end of the docu‐
222           ment.
223
224           Returns 1 if successful, 0 in case of error.
225
226       close ()
227           This method releases any resources allocated by the current
228           instance and closes any underlying input. It returns 0 on failure
229           and 1 on success. This method is automatically called by the
230           destructor when the reader is forgotten, therefore you do not have
231           to call it directly.
232

METHODS EXTRACTING INFORMATION

234       name ()
235           Returns the qualified name of the current node, equal to (Pre‐
236           fix:)LocalName.
237
238       nodeType ()
239           Returns the type of the current node. See NODE TYPES below.
240
241       localName ()
242           Returns the local name of the node.
243
244       prefix ()
245           Returns the prefix of the namespace associated with the node.
246
247       namespaceURI ()
248           Returns the URI defining the namespace associated with the node.
249
250       isEmptyElement ()
251           Check if the current node is empty, this is a bit bizarre in the
252           sense that <a/> will be considered empty while <a></a> will not.
253
254       hasValue ()
255           Returns true if the node can have a text value.
256
257       value ()
258           Provides the text value of the node if present or undef if not
259           available.
260
261       readInnerXml ()
262           Reads the contents of the current node, including child nodes and
263           markup.  Returns a string containing the XML of the node's content,
264           or undef if the current node is neither an element nor attribute,
265           or has no child nodes.
266
267       readOuterXml ()
268           Reads the contents of the current node, including child nodes and
269           markup.
270
271           Returns a string containing the XML of the node including its con‐
272           tent, or undef if the current node is neither an element nor
273           attribute.
274

METHODS EXTRACTING DOM NODES

276       document ()
277           Provides access to the document tree built by the reader. This
278           function can be used to collect the preserved nodes (see preserveN‐
279           ode() and preservePattern).
280
281           CAUTION: Never use this function to modify the tree unless reading
282           of the whole document is completed!
283
284       copyCurrentNode (deep)
285           This function is similar a DOM function copyNode(). It returns a
286           copy of the currently processed node as a corresponding DOM object.
287           Use deep = 1 to obtain the full subtree.
288
289       preserveNode ()
290           This tells the XML Reader to preserve the current node in the docu‐
291           ment tree. A document tree consisting of the preserved nodes and
292           their content can be obtained using the method document() once
293           parsing is finished.
294
295           Returns the node or NULL in case of error.
296
297       preservePattern (pattern,\%ns_map)
298           This tells the XML Reader to preserve all nodes matched by the pat‐
299           tern (which is a streaming XPath subset). A document tree consist‐
300           ing of the preserved nodes and their content can be obtained using
301           the method document() once parsing is finished.
302
303           An optional second argument can be used to provide a HASH reference
304           mapping prefixes used by the XPath to namespace URIs.
305
306           The XPath subset available with this function is described at
307
308             http://www.w3.org/TR/xmlschema-1/#Selector
309
310           and matches the production
311
312             Path ::= ('.//')? ( Step '/' )* ( Step ⎪ '@' NameTest )
313
314           Returns a positive number in case of success and -1 in case of
315           error
316

METHODS PROCESSING ATTRIBUTES

318       attributeCount ()
319           Provides the number of attributes of the current node.
320
321       hasAttributes ()
322           Whether the node has attributes.
323
324       getAttribute (name)
325           Provides the value of the attribute with the specified qualified
326           name.
327
328           Returns a string containing the value of the specified attribute,
329           or undef in case of error.
330
331       getAttributeNs (localName, namespaceURI)
332           Provides the value of the specified attribute.
333
334           Returns a string containing the value of the specified attribute,
335           or undef in case of error.
336
337       getAttributeNo (no)
338           Provides the value of the attribute with the specified index rela‐
339           tive to the containing element.
340
341           Returns a string containing the value of the specified attribute,
342           or undef in case of error.
343
344       isDefault ()
345           Returns true if the current attribute node was generated from the
346           default value defined in the DTD.
347
348       moveToAttribute (name)
349           Moves the position to the attribute with the specified local name
350           and namespace URI.
351
352           Returns 1 in case of success, -1 in case of error, 0 if not found
353
354       moveToAttributeNo (no)
355           Moves the position to the attribute with the specified index rela‐
356           tive to the containing element.
357
358           Returns 1 in case of success, -1 in case of error, 0 if not found
359
360       moveToAttributeNs (localName,namespaceURI)
361           Moves the position to the attribute with the specified local name
362           and namespace URI.
363
364           Returns 1 in case of success, -1 in case of error, 0 if not found
365
366       moveToFirstAttribute ()
367           Moves the position to the first attribute associated with the cur‐
368           rent node.
369
370           Returns 1 in case of success, -1 in case of error, 0 if not found
371
372       moveToNextAttribute ()
373           Moves the position to the next attribute associated with the cur‐
374           rent node.
375
376           Returns 1 in case of success, -1 in case of error, 0 if not found
377
378       moveToElement ()
379           Moves the position to the node that contains the current attribute
380           node.
381
382           Returns 1 in case of success, -1 in case of error, 0 if not moved
383
384       isNamespaceDecl ()
385           Determine whether the current node is a namespace declaration
386           rather than a regular attribute.
387
388           Returns 1 if the current node is a namespace declaration, 0 if it
389           is a regular attribute or other type of node, or -1 in case of
390           error.
391

OTHER METHODS

393       lookupNamespace (prefix)
394           Resolves a namespace prefix in the scope of the current element.
395
396           Returns a string containing the namespace URI to which the prefix
397           maps or undef in case of error.
398
399       encoding ()
400           Returns a string containing the encoding of the document or undef
401           in case of error.
402
403       standalone ()
404           Determine the standalone status of the document being read. Returns
405           1 if the document was declared to be standalone, 0 if it was
406           declared to be not standalone, or -1 if the document did not spec‐
407           ify its standalone status or in case of error.
408
409       xmlVersion ()
410           Determine the XML version of the document being read. Returns a
411           string containing the XML version of the document or undef in case
412           of error.
413
414       baseURI ()
415           The base URI of the node. See the XML Base W3C specification.
416
417       isValid ()
418           Retrieve the validity status from the parser.
419
420           Returns 1 if valid, 0 if no, and -1 in case of error.
421
422       xmlLang ()
423           The xml:lang scope within which the node resides.
424
425       lineNumber ()
426           Provide the line number of the current parsing point. Available if
427           libxml2 >= 2.6.17.
428
429       columnNumber ()
430           Provide the column number of the current parsing point. Available
431           if libxml2 >= 2.6.17.
432
433       byteConsumed ()
434           This function provides the current index of the parser relative to
435           the start of the current entity. This function is computed in bytes
436           from the beginning starting at zero and finishing at the size in
437           bytes of the file if parsing a file. The function is of constant
438           cost if the input is UTF-8 but can be costly if run on non-UTF-8
439           input. Available if libxml2 >= 2.6.18.
440
441       setParserProp (prop = value, ...)>
442           Change the parser processing behaviour by changing some of its
443           internal properties. The following properties are available with
444           this function: ``load_ext_dtd'', ``complete_attributes'', ``valida‐
445           tion'', ``expand_entities''.
446
447           Since some of the properties can only be changed before any read
448           has been done, it is best to set the parsing properties at the con‐
449           structor.
450
451           Returns 0 if the call was successful, or -1 in case of error
452
453       getParserProp (prop)
454           Get value of an parser internal property. The following property
455           names can be used: ``load_ext_dtd'', ``complete_attributes'',
456           ``validation'', ``expand_entities''.
457
458           Returns the value, usually 0 or 1, or -1 in case of error.
459

DESTRUCTION

461       XML::LibXML takes care of the reader object destruction when the last
462       reference to the reader object goes out of scope. The document tree is
463       preserved, though, if either of $reader->document or $reader->preser‐
464       veNode was used and references to the document tree exist.
465

NODE TYPES

467       The reader interface provides the following constants for node types
468       (the constant symbols are exported by default or if tag :types is
469       used).
470
471           XML_READER_TYPE_NONE                    => 0
472           XML_READER_TYPE_ELEMENT                 => 1
473           XML_READER_TYPE_ATTRIBUTE               => 2
474           XML_READER_TYPE_TEXT                    => 3
475           XML_READER_TYPE_CDATA                   => 4
476           XML_READER_TYPE_ENTITY_REFERENCE        => 5
477           XML_READER_TYPE_ENTITY                  => 6
478           XML_READER_TYPE_PROCESSING_INSTRUCTION  => 7
479           XML_READER_TYPE_COMMENT                 => 8
480           XML_READER_TYPE_DOCUMENT                => 9
481           XML_READER_TYPE_DOCUMENT_TYPE           => 10
482           XML_READER_TYPE_DOCUMENT_FRAGMENT       => 11
483           XML_READER_TYPE_NOTATION                => 12
484           XML_READER_TYPE_WHITESPACE              => 13
485           XML_READER_TYPE_SIGNIFICANT_WHITESPACE  => 14
486           XML_READER_TYPE_END_ELEMENT             => 15
487           XML_READER_TYPE_END_ENTITY              => 16
488           XML_READER_TYPE_XML_DECLARATION         => 17
489

STATES

491       The following constants represent the values returned by readState().
492       They are exported by default, or if tag :states is used:
493
494           XML_READER_NONE      => -1
495           XML_READER_START     =>  0
496           XML_READER_ELEMENT   =>  1
497           XML_READER_END       =>  2
498           XML_READER_EMPTY     =>  3
499           XML_READER_BACKTRACK =>  4
500           XML_READER_DONE      =>  5
501           XML_READER_ERROR     =>  6
502

VERSION

504       0.02
505

AUTHORS

507       Heiko Klein, <H.Klein@gmx.net<gt> and Petr Pajas, <pajas@matfyz.cz<gt>
508

AUTHORS

515       Matt Sergeant, Christian Glahn, Petr Pajas,
516

VERSION

518       1.62
519

COPYRIGHT

521       2001-2006, AxKit.com Ltd; 2002-2006 Christian Glahn; 2006 Petr Pajas,
522       All rights reserved.
523
524
525
526perl v5.8.8                       2006-11-17            XML::LibXML::Reader(3)