XML::LibXML::Reader(3pm)

1XML::LibXML::Reader(3)User Contributed Perl DocumentationXML::LibXML::Reader(3)
2
3
4

NAME

6       XML::LibXML::Reader - XML::LibXML::Reader - interface to libxml2 pull
7       parser
8

SYNOPSIS

10         use XML::LibXML::Reader;
11
12
13
14         my $reader = XML::LibXML::Reader->new(location => "file.xml")
15                or die "cannot read file.xml\n";
16         while ($reader->read) {
17           processNode($reader);
18         }
19
20
21
22         sub processNode {
23             my $reader = shift;
24             printf "%d %d %s %d\n", ($reader->depth,
25                                      $reader->nodeType,
26                                      $reader->name,
27                                      $reader->isEmptyElement);
28         }
29
30       or
31
32         my $reader = XML::LibXML::Reader->new(location => "file.xml")
33                or die "cannot read file.xml\n";
34           $reader->preservePattern('//table/tr');
35           $reader->finish;
36           print $reader->document->toString(1);
37

DESCRIPTION

39       This is a perl interface to libxml2's pull-parser implementation
40       xmlTextReader http://xmlsoft.org/html/libxml-xmlreader.html. This
41       feature requires at least libxml2-2.6.21. Pull-parsers (such as StAX in
42       Java, or XmlReader in C#) use an iterator approach to parse XML
43       documents. They are easier to program than event-based parser (SAX) and
44       much more lightweight than tree-based parser (DOM), which load the
45       complete tree into memory.
46
47       The Reader acts as a cursor going forward on the document stream and
48       stopping at each node in the way. At every point DOM-like methods of
49       the Reader object allow to examine the current node (name, namespace,
50       attributes, etc.)
51
52       The user's code keeps control of the progress and simply calls the
53       "read()" function repeatedly to progress to the next node in the
54       document order. Other functions provide means for skipping complete
55       sub-trees, or nodes until a specific element, etc.
56
57       At every time, only a very limited portion of the document is kept in
58       the memory, which makes the API more memory-efficient than using DOM.
59       However, it is also possible to mix Reader with DOM. At every point the
60       user may copy the current node (optionally expanded into a complete
61       sub-tree) from the processed document to another DOM tree, or to
62       instruct the Reader to collect sub-document in form of a DOM tree
63       consisting of selected nodes.
64
65       Reader API also supports namespaces, xml:base, entity handling, and DTD
66       validation. Schema and RelaxNG validation support will probably be
67       added in some later revision of the Perl interface.
68
69       The naming of methods compared to libxml2 and C# XmlTextReader has been
70       changed slightly to match the conventions of XML::LibXML. Some
71       functions have been changed or added with respect to the C interface.
72

CONSTRUCTOR

74       Depending on the XML source, the Reader object can be created with
75       either of:
76
77         my $reader = XML::LibXML::Reader->new( location => "file.xml", ... );
78           my $reader = XML::LibXML::Reader->new( string => $xml_string, ... );
79           my $reader = XML::LibXML::Reader->new( IO => $file_handle, ... );
80           my $reader = XML::LibXML::Reader->new( FD => fileno(STDIN), ... );
81           my $reader = XML::LibXML::Reader->new( DOM => $dom, ... );
82
83       where ... are (optional) reader options described below in "Reader
84       options" or various parser options described in XML::LibXML::Parser.
85       The constructor recognizes the following XML sources:
86
87   Source specification
88       location
89           Read XML from a local file or URL.
90
91       string
92           Read XML from a string.
93
94       IO  Read XML a Perl IO filehandle.
95
96       FD  Read XML from a file descriptor (bypasses Perl I/O layer, only
97           applicable to filehandles for regular files or pipes). Possibly
98           faster than IO.
99
100       DOM Use reader API to walk through a pre-parsed XML::LibXML::Document.
101
102   Reader options
103       encoding => $encoding
104           override document encoding.
105
106       RelaxNG => $rng_schema
107           can be used to pass either a XML::LibXML::RelaxNG object or a
108           filename or URL of a RelaxNG schema to the constructor. The schema
109           is then used to validate the document as it is processed.
110
111       Schema => $xsd_schema
112           can be used to pass either a XML::LibXML::Schema object or a
113           filename or URL of a W3C XSD schema to the constructor. The schema
114           is then used to validate the document as it is processed.
115
116       ... the reader further supports various parser options described in
117           XML::LibXML::Parser (specifically those labeled by /reader/).
118

METHODS CONTROLLING PARSING PROGRESS

120       read ()
121           Moves the position to the next node in the stream, exposing its
122           properties.
123
124           Returns 1 if the node was read successfully, 0 if there is no more
125           nodes to read, or -1 in case of error
126
127       readAttributeValue ()
128           Parses an attribute value into one or more Text and EntityReference
129           nodes.
130
131           Returns 1 in case of success, 0 if the reader was not positioned on
132           an attribute node or all the attribute values have been read, or -1
133           in case of error.
134
135       readState ()
136           Gets the read state of the reader. Returns the state value, or -1
137           in case of error. The module exports constants for the Reader
138           states, see STATES below.
139
140       depth ()
141           The depth of the node in the tree, starts at 0 for the root node.
142
143       next ()
144           Skip to the node following the current one in the document order
145           while avoiding the sub-tree if any. Returns 1 if the node was read
146           successfully, 0 if there is no more nodes to read, or -1 in case of
147           error.
148
149       nextElement (localname?,nsURI?)
150           Skip nodes following the current one in the document order until a
151           specific element is reached. The element's name must be equal to a
152           given localname if defined, and its namespace must equal to a given
153           nsURI if defined. Either of the arguments can be undefined (or
154           omitted, in case of the latter or both).
155
156           Returns 1 if the element was found, 0 if there is no more nodes to
157           read, or -1 in case of error.
158
159       nextPatternMatch (compiled_pattern)
160           Skip nodes following the current one in the document order until an
161           element matching a given compiled pattern is reached. See
162           XML::LibXML::Pattern for information on compiled patterns. See also
163           the "matchesPattern" method.
164
165           Returns 1 if the element was found, 0 if there is no more nodes to
166           read, or -1 in case of error.
167
168       skipSiblings ()
169           Skip all nodes on the same or lower level until the first node on a
170           higher level is reached. In particular, if the current node occurs
171           in an element, the reader stops at the end tag of the parent
172           element, otherwise it stops at a node immediately following the
173           parent node.
174
175           Returns 1 if successful, 0 if end of the document is reached, or -1
176           in case of error.
177
178       nextSibling ()
179           It skips to the node following the current one in the document
180           order while avoiding the sub-tree if any.
181
182           Returns 1 if the node was read successfully, 0 if there is no more
183           nodes to read, or -1 in case of error
184
185       nextSiblingElement (name?,nsURI?)
186           Like nextElement but only processes sibling elements of the current
187           node (moving forward using "nextSibling ()" rather than "read ()",
188           internally).
189
190           Returns 1 if the element was found, 0 if there is no more sibling
191           nodes, or -1 in case of error.
192
193       finish ()
194           Skip all remaining nodes in the document, reaching end of the
195           document.
196
197           Returns 1 if successful, 0 in case of error.
198
199       close ()
200           This method releases any resources allocated by the current
201           instance and closes any underlying input. It returns 0 on failure
202           and 1 on success. This method is automatically called by the
203           destructor when the reader is forgotten, therefore you do not have
204           to call it directly.
205

METHODS EXTRACTING INFORMATION

207       name ()
208           Returns the qualified name of the current node, equal to
209           (Prefix:)LocalName.
210
211       nodeType ()
212           Returns the type of the current node. See NODE TYPES below.
213
214       localName ()
215           Returns the local name of the node.
216
217       prefix ()
218           Returns the prefix of the namespace associated with the node.
219
220       namespaceURI ()
221           Returns the URI defining the namespace associated with the node.
222
223       isEmptyElement ()
224           Check if the current node is empty, this is a bit bizarre in the
225           sense that <a/> will be considered empty while <a></a> will not.
226
227       hasValue ()
228           Returns true if the node can have a text value.
229
230       value ()
231           Provides the text value of the node if present or undef if not
232           available.
233
234       readInnerXml ()
235           Reads the contents of the current node, including child nodes and
236           markup.  Returns a string containing the XML of the node's content,
237           or undef if the current node is neither an element nor attribute,
238           or has no child nodes.
239
240       readOuterXml ()
241           Reads the contents of the current node, including child nodes and
242           markup.
243
244           Returns a string containing the XML of the node including its
245           content, or undef if the current node is neither an element nor
246           attribute.
247
248       nodePath()
249           Returns a canonical location path to the current element from the
250           root node to the current node. Namespaced elements are matched by
251           '*', because there is no way to declare prefixes within XPath
252           patterns. Unlike "XML::LibXML::Node::nodePath()", this function
253           does not provide sibling counts (i.e. instead of e.g. '/a/b[1]' and
254           '/a/b[2]' you get '/a/b' for both matches).
255
256       matchesPattern(compiled_pattern)
257           Returns a true value if the current node matches a compiled
258           pattern. See XML::LibXML::Pattern for information on compiled
259           patterns. See also the "nextPatternMatch" method.
260

METHODS EXTRACTING DOM NODES

262       document ()
263           Provides access to the document tree built by the reader. This
264           function can be used to collect the preserved nodes (see
265           "preserveNode()" and preservePattern).
266
267           CAUTION: Never use this function to modify the tree unless reading
268           of the whole document is completed!
269
270       copyCurrentNode (deep)
271           This function is similar a DOM function "copyNode()". It returns a
272           copy of the currently processed node as a corresponding DOM object.
273           Use deep = 1 to obtain the full sub-tree.
274
275       preserveNode ()
276           This tells the XML Reader to preserve the current node in the
277           document tree. A document tree consisting of the preserved nodes
278           and their content can be obtained using the method "document()"
279           once parsing is finished.
280
281           Returns the node or NULL in case of error.
282
283       preservePattern (pattern,\%ns_map)
284           This tells the XML Reader to preserve all nodes matched by the
285           pattern (which is a streaming XPath subset). A document tree
286           consisting of the preserved nodes and their content can be obtained
287           using the method "document()" once parsing is finished.
288
289           An optional second argument can be used to provide a HASH reference
290           mapping prefixes used by the XPath to namespace URIs.
291
292           The XPath subset available with this function is described at
293
294             http://www.w3.org/TR/xmlschema-1/#Selector
295
296           and matches the production
297
298             Path ::= ('.//')? ( Step '/' )* ( Step | '@' NameTest )
299
300           Returns a positive number in case of success and -1 in case of
301           error
302

METHODS PROCESSING ATTRIBUTES

304       attributeCount ()
305           Provides the number of attributes of the current node.
306
307       hasAttributes ()
308           Whether the node has attributes.
309
310       getAttribute (name)
311           Provides the value of the attribute with the specified qualified
312           name.
313
314           Returns a string containing the value of the specified attribute,
315           or undef in case of error.
316
317       getAttributeNs (localName, namespaceURI)
318           Provides the value of the specified attribute.
319
320           Returns a string containing the value of the specified attribute,
321           or undef in case of error.
322
323       getAttributeNo (no)
324           Provides the value of the attribute with the specified index
325           relative to the containing element.
326
327           Returns a string containing the value of the specified attribute,
328           or undef in case of error.
329
330       isDefault ()
331           Returns true if the current attribute node was generated from the
332           default value defined in the DTD.
333
334       moveToAttribute (name)
335           Moves the position to the attribute with the specified local name
336           and namespace URI.
337
338           Returns 1 in case of success, -1 in case of error, 0 if not found
339
340       moveToAttributeNo (no)
341           Moves the position to the attribute with the specified index
342           relative to the containing element.
343
344           Returns 1 in case of success, -1 in case of error, 0 if not found
345
346       moveToAttributeNs (localName,namespaceURI)
347           Moves the position to the attribute with the specified local name
348           and namespace URI.
349
350           Returns 1 in case of success, -1 in case of error, 0 if not found
351
352       moveToFirstAttribute ()
353           Moves the position to the first attribute associated with the
354           current node.
355
356           Returns 1 in case of success, -1 in case of error, 0 if not found
357
358       moveToNextAttribute ()
359           Moves the position to the next attribute associated with the
360           current node.
361
362           Returns 1 in case of success, -1 in case of error, 0 if not found
363
364       moveToElement ()
365           Moves the position to the node that contains the current attribute
366           node.
367
368           Returns 1 in case of success, -1 in case of error, 0 if not moved
369
370       isNamespaceDecl ()
371           Determine whether the current node is a namespace declaration
372           rather than a regular attribute.
373
374           Returns 1 if the current node is a namespace declaration, 0 if it
375           is a regular attribute or other type of node, or -1 in case of
376           error.
377

OTHER METHODS

379       lookupNamespace (prefix)
380           Resolves a namespace prefix in the scope of the current element.
381
382           Returns a string containing the namespace URI to which the prefix
383           maps or undef in case of error.
384
385       encoding ()
386           Returns a string containing the encoding of the document or undef
387           in case of error.
388
389       standalone ()
390           Determine the standalone status of the document being read. Returns
391           1 if the document was declared to be standalone, 0 if it was
392           declared to be not standalone, or -1 if the document did not
393           specify its standalone status or in case of error.
394
395       xmlVersion ()
396           Determine the XML version of the document being read. Returns a
397           string containing the XML version of the document or undef in case
398           of error.
399
400       baseURI ()
401           Returns the base URI of a given node.
402
403       isValid ()
404           Retrieve the validity status from the parser.
405
406           Returns 1 if valid, 0 if no, and -1 in case of error.
407
408       xmlLang ()
409           The xml:lang scope within which the node resides.
410
411       lineNumber ()
412           Provide the line number of the current parsing point.
413
414       columnNumber ()
415           Provide the column number of the current parsing point.
416
417       byteConsumed ()
418           This function provides the current index of the parser relative to
419           the start of the current entity. This function is computed in bytes
420           from the beginning starting at zero and finishing at the size in
421           bytes of the file if parsing a file. The function is of constant
422           cost if the input is UTF-8 but can be costly if run on non-UTF-8
423           input.
424
425       setParserProp (prop => value, ...)
426           Change the parser processing behaviour by changing some of its
427           internal properties. The following properties are available with
428           this function: ``load_ext_dtd'', ``complete_attributes'',
429           ``validation'', ``expand_entities''.
430
431           Since some of the properties can only be changed before any read
432           has been done, it is best to set the parsing properties at the
433           constructor.
434
435           Returns 0 if the call was successful, or -1 in case of error
436
437       getParserProp (prop)
438           Get value of an parser internal property. The following property
439           names can be used: ``load_ext_dtd'', ``complete_attributes'',
440           ``validation'', ``expand_entities''.
441
442           Returns the value, usually 0 or 1, or -1 in case of error.
443

DESTRUCTION

445       XML::LibXML takes care of the reader object destruction when the last
446       reference to the reader object goes out of scope. The document tree is
447       preserved, though, if either of $reader->document or
448       $reader->preserveNode was used and references to the document tree
449       exist.
450

NODE TYPES

452       The reader interface provides the following constants for node types
453       (the constant symbols are exported by default or if tag ":types" is
454       used).
455
456         XML_READER_TYPE_NONE                    => 0
457         XML_READER_TYPE_ELEMENT                 => 1
458         XML_READER_TYPE_ATTRIBUTE               => 2
459         XML_READER_TYPE_TEXT                    => 3
460         XML_READER_TYPE_CDATA                   => 4
461         XML_READER_TYPE_ENTITY_REFERENCE        => 5
462         XML_READER_TYPE_ENTITY                  => 6
463         XML_READER_TYPE_PROCESSING_INSTRUCTION  => 7
464         XML_READER_TYPE_COMMENT                 => 8
465         XML_READER_TYPE_DOCUMENT                => 9
466         XML_READER_TYPE_DOCUMENT_TYPE           => 10
467         XML_READER_TYPE_DOCUMENT_FRAGMENT       => 11
468         XML_READER_TYPE_NOTATION                => 12
469         XML_READER_TYPE_WHITESPACE              => 13
470         XML_READER_TYPE_SIGNIFICANT_WHITESPACE  => 14
471         XML_READER_TYPE_END_ELEMENT             => 15
472         XML_READER_TYPE_END_ENTITY              => 16
473         XML_READER_TYPE_XML_DECLARATION         => 17
474

STATES

476       The following constants represent the values returned by "readState()".
477       They are exported by default, or if tag ":states" is used:
478
479         XML_READER_NONE      => -1
480         XML_READER_START     =>  0
481         XML_READER_ELEMENT   =>  1
482         XML_READER_END       =>  2
483         XML_READER_EMPTY     =>  3
484         XML_READER_BACKTRACK =>  4
485         XML_READER_DONE      =>  5
486         XML_READER_ERROR     =>  6
487

ORIGINAL IMPLEMENTATION

496       Heiko Klein, <H.Klein@gmx.net<gt> and Petr Pajas
497

AUTHORS

499       Matt Sergeant, Christian Glahn, Petr Pajas
500

VERSION

502       2.0018
503

COPYRIGHT

505       2001-2007, AxKit.com Ltd.
506
507       2002-2006, Christian Glahn.
508
509       2006-2009, Petr Pajas.
510
511
512
513perl v5.16.3                      2013-05-13            XML::LibXML::Reader(3)