1XML::LibXML::Reader(3)User Contributed Perl DocumentationXML::LibXML::Reader(3)
2
3
4

NAME

6       XML::LibXML::Reader - XML::LibXML::Reader - interface to libxml2 pull
7       parser
8

SYNOPSIS

10         use XML::LibXML::Reader;
11
12
13
14         my $reader = XML::LibXML::Reader->new(location => "file.xml")
15                or die "cannot read file.xml\n";
16         while ($reader->read) {
17           processNode($reader);
18         }
19
20
21
22         sub processNode {
23             my $reader = shift;
24             printf "%d %d %s %d\n", ($reader->depth,
25                                      $reader->nodeType,
26                                      $reader->name,
27                                      $reader->isEmptyElement);
28         }
29
30       or
31
32         my $reader = XML::LibXML::Reader->new(location => "file.xml")
33                or die "cannot read file.xml\n";
34           $reader->preservePattern('//table/tr');
35           $reader->finish;
36           print $reader->document->toString(1);
37

DESCRIPTION

39       This is a perl interface to libxml2's pull-parser implementation
40       xmlTextReader http://xmlsoft.org/html/libxml-xmlreader.html. This
41       feature requires at least libxml2-2.6.21. Pull-parsers (such as StAX in
42       Java, or XmlReader in C#) use an iterator approach to parse XML
43       documents. They are easier to program than event-based parser (SAX) and
44       much more lightweight than tree-based parser (DOM), which load the
45       complete tree into memory.
46
47       The Reader acts as a cursor going forward on the document stream and
48       stopping at each node on the way. At every point, the DOM-like methods
49       of the Reader object allow one to examine the current node (name,
50       namespace, attributes, etc.)
51
52       The user's code keeps control of the progress and simply calls the
53       "read()" function repeatedly to progress to the next node in the
54       document order. Other functions provide means for skipping complete
55       sub-trees, or nodes until a specific element, etc.
56
57       At every time, only a very limited portion of the document is kept in
58       the memory, which makes the API more memory-efficient than using DOM.
59       However, it is also possible to mix Reader with DOM. At every point the
60       user may copy the current node (optionally expanded into a complete
61       sub-tree) from the processed document to another DOM tree, or to
62       instruct the Reader to collect sub-document in form of a DOM tree
63       consisting of selected nodes.
64
65       Reader API also supports namespaces, xml:base, entity handling, and DTD
66       validation. Schema and RelaxNG validation support will probably be
67       added in some later revision of the Perl interface.
68
69       The naming of methods compared to libxml2 and C# XmlTextReader has been
70       changed slightly to match the conventions of XML::LibXML. Some
71       functions have been changed or added with respect to the C interface.
72

CONSTRUCTOR

74       Depending on the XML source, the Reader object can be created with
75       either of:
76
77         my $reader = XML::LibXML::Reader->new( location => "file.xml", ... );
78           my $reader = XML::LibXML::Reader->new( string => $xml_string, ... );
79           my $reader = XML::LibXML::Reader->new( IO => $file_handle, ... );
80           my $reader = XML::LibXML::Reader->new( FD => fileno(STDIN), ... );
81           my $reader = XML::LibXML::Reader->new( DOM => $dom, ... );
82
83       where ... are (optional) reader options described below in "Reader
84       options" or various parser options described in XML::LibXML::Parser.
85       The constructor recognizes the following XML sources:
86
87   Source specification
88       location
89           Read XML from a local file or (non-HTTPS) URL.
90
91       string
92           Read XML from a string.
93
94       IO  Read XML a Perl IO filehandle.
95
96       FD  Read XML from a file descriptor (bypasses Perl I/O layer, only
97           applicable to filehandles for regular files or pipes). Possibly
98           faster than IO.
99
100       DOM Use reader API to walk through a pre-parsed XML::LibXML::Document.
101
102   Reader options
103       encoding => $encoding
104           override document encoding.
105
106       RelaxNG => $rng_schema
107           can be used to pass either a XML::LibXML::RelaxNG object or a
108           filename or (non-HTTPS) URL of a RelaxNG schema to the constructor.
109           The schema is then used to validate the document as it is
110           processed.
111
112       Schema => $xsd_schema
113           can be used to pass either a XML::LibXML::Schema object or a
114           filename or (non-HTTPS) URL of a W3C XSD schema to the constructor.
115           The schema is then used to validate the document as it is
116           processed.
117
118       ... the reader further supports various parser options described in
119           XML::LibXML::Parser (specifically those labeled by /reader/).
120

METHODS CONTROLLING PARSING PROGRESS

122       read ()
123           Moves the position to the next node in the stream, exposing its
124           properties.
125
126           Returns 1 if the node was read successfully, 0 if there is no more
127           nodes to read, or -1 in case of error
128
129       readAttributeValue ()
130           Parses an attribute value into one or more Text and EntityReference
131           nodes.
132
133           Returns 1 in case of success, 0 if the reader was not positioned on
134           an attribute node or all the attribute values have been read, or -1
135           in case of error.
136
137       readState ()
138           Gets the read state of the reader. Returns the state value, or -1
139           in case of error. The module exports constants for the Reader
140           states, see STATES below.
141
142       depth ()
143           The depth of the node in the tree, starts at 0 for the root node.
144
145       next ()
146           Skip to the node following the current one in the document order
147           while avoiding the sub-tree if any. Returns 1 if the node was read
148           successfully, 0 if there is no more nodes to read, or -1 in case of
149           error.
150
151       nextElement (localname?,nsURI?)
152           Skip nodes following the current one in the document order until a
153           specific element is reached. The element's name must be equal to a
154           given localname if defined, and its namespace must equal to a given
155           nsURI if defined. Either of the arguments can be undefined (or
156           omitted, in case of the latter or both).
157
158           Returns 1 if the element was found, 0 if there is no more nodes to
159           read, or -1 in case of error.
160
161       nextPatternMatch (compiled_pattern)
162           Skip nodes following the current one in the document order until an
163           element matching a given compiled pattern is reached. See
164           XML::LibXML::Pattern for information on compiled patterns. See also
165           the "matchesPattern" method.
166
167           Returns 1 if the element was found, 0 if there is no more nodes to
168           read, or -1 in case of error.
169
170       skipSiblings ()
171           Skip all nodes on the same or lower level until the first node on a
172           higher level is reached. In particular, if the current node occurs
173           in an element, the reader stops at the end tag of the parent
174           element, otherwise it stops at a node immediately following the
175           parent node.
176
177           Returns 1 if successful, 0 if end of the document is reached, or -1
178           in case of error.
179
180       nextSibling ()
181           It skips to the node following the current one in the document
182           order while avoiding the sub-tree if any.
183
184           Returns 1 if the node was read successfully, 0 if there is no more
185           nodes to read, or -1 in case of error
186
187       nextSiblingElement (name?,nsURI?)
188           Like nextElement but only processes sibling elements of the current
189           node (moving forward using "nextSibling ()" rather than "read ()",
190           internally).
191
192           Returns 1 if the element was found, 0 if there is no more sibling
193           nodes, or -1 in case of error.
194
195       finish ()
196           Skip all remaining nodes in the document, reaching end of the
197           document.
198
199           Returns 1 if successful, 0 in case of error.
200
201       close ()
202           This method releases any resources allocated by the current
203           instance and closes any underlying input. It returns 0 on failure
204           and 1 on success. This method is automatically called by the
205           destructor when the reader is forgotten, therefore you do not have
206           to call it directly.
207

METHODS EXTRACTING INFORMATION

209       name ()
210           Returns the qualified name of the current node, equal to
211           (Prefix:)LocalName.
212
213       nodeType ()
214           Returns the type of the current node. See NODE TYPES below.
215
216       localName ()
217           Returns the local name of the node.
218
219       prefix ()
220           Returns the prefix of the namespace associated with the node.
221
222       namespaceURI ()
223           Returns the URI defining the namespace associated with the node.
224
225       isEmptyElement ()
226           Check if the current node is empty, this is a bit bizarre in the
227           sense that <a/> will be considered empty while <a></a> will not.
228
229       hasValue ()
230           Returns true if the node can have a text value.
231
232       value ()
233           Provides the text value of the node if present or undef if not
234           available.
235
236       readInnerXml ()
237           Reads the contents of the current node, including child nodes and
238           markup.  Returns a string containing the XML of the node's content,
239           or undef if the current node is neither an element nor attribute,
240           or has no child nodes.
241
242       readOuterXml ()
243           Reads the contents of the current node, including child nodes and
244           markup.
245
246           Returns a string containing the XML of the node including its
247           content, or undef if the current node is neither an element nor
248           attribute.
249
250       nodePath()
251           Returns a canonical location path to the current element from the
252           root node to the current node. Namespaced elements are matched by
253           '*', because there is no way to declare prefixes within XPath
254           patterns. Unlike "XML::LibXML::Node::nodePath()", this function
255           does not provide sibling counts (i.e. instead of e.g. '/a/b[1]' and
256           '/a/b[2]' you get '/a/b' for both matches).
257
258       matchesPattern(compiled_pattern)
259           Returns a true value if the current node matches a compiled
260           pattern. See XML::LibXML::Pattern for information on compiled
261           patterns. See also the "nextPatternMatch" method.
262

METHODS EXTRACTING DOM NODES

264       document ()
265           Provides access to the document tree built by the reader. This
266           function can be used to collect the preserved nodes (see
267           "preserveNode()" and preservePattern).
268
269           CAUTION: Never use this function to modify the tree unless reading
270           of the whole document is completed!
271
272       copyCurrentNode (deep)
273           This function is similar a DOM function "copyNode()". It returns a
274           copy of the currently processed node as a corresponding DOM object.
275           Use deep = 1 to obtain the full sub-tree.
276
277       preserveNode ()
278           This tells the XML Reader to preserve the current node in the
279           document tree. A document tree consisting of the preserved nodes
280           and their content can be obtained using the method "document()"
281           once parsing is finished.
282
283           Returns the node or NULL in case of error.
284
285       preservePattern (pattern,\%ns_map)
286           This tells the XML Reader to preserve all nodes matched by the
287           pattern (which is a streaming XPath subset). A document tree
288           consisting of the preserved nodes and their content can be obtained
289           using the method "document()" once parsing is finished.
290
291           An optional second argument can be used to provide a HASH reference
292           mapping prefixes used by the XPath to namespace URIs.
293
294           The XPath subset available with this function is described at
295
296             http://www.w3.org/TR/xmlschema-1/#Selector
297
298           and matches the production
299
300             Path ::= ('.//')? ( Step '/' )* ( Step | '@' NameTest )
301
302           Returns a positive number in case of success and -1 in case of
303           error
304

METHODS PROCESSING ATTRIBUTES

306       attributeCount ()
307           Provides the number of attributes of the current node.
308
309       hasAttributes ()
310           Whether the node has attributes.
311
312       getAttribute (name)
313           Provides the value of the attribute with the specified qualified
314           name.
315
316           Returns a string containing the value of the specified attribute,
317           or undef in case of error.
318
319       getAttributeNs (localName, namespaceURI)
320           Provides the value of the specified attribute.
321
322           Returns a string containing the value of the specified attribute,
323           or undef in case of error.
324
325       getAttributeNo (no)
326           Provides the value of the attribute with the specified index
327           relative to the containing element.
328
329           Returns a string containing the value of the specified attribute,
330           or undef in case of error.
331
332       isDefault ()
333           Returns true if the current attribute node was generated from the
334           default value defined in the DTD.
335
336       moveToAttribute (name)
337           Moves the position to the attribute with the specified local name
338           and namespace URI.
339
340           Returns 1 in case of success, -1 in case of error, 0 if not found
341
342       moveToAttributeNo (no)
343           Moves the position to the attribute with the specified index
344           relative to the containing element.
345
346           Returns 1 in case of success, -1 in case of error, 0 if not found
347
348       moveToAttributeNs (localName,namespaceURI)
349           Moves the position to the attribute with the specified local name
350           and namespace URI.
351
352           Returns 1 in case of success, -1 in case of error, 0 if not found
353
354       moveToFirstAttribute ()
355           Moves the position to the first attribute associated with the
356           current node.
357
358           Returns 1 in case of success, -1 in case of error, 0 if not found
359
360       moveToNextAttribute ()
361           Moves the position to the next attribute associated with the
362           current node.
363
364           Returns 1 in case of success, -1 in case of error, 0 if not found
365
366       moveToElement ()
367           Moves the position to the node that contains the current attribute
368           node.
369
370           Returns 1 in case of success, -1 in case of error, 0 if not moved
371
372       isNamespaceDecl ()
373           Determine whether the current node is a namespace declaration
374           rather than a regular attribute.
375
376           Returns 1 if the current node is a namespace declaration, 0 if it
377           is a regular attribute or other type of node, or -1 in case of
378           error.
379

OTHER METHODS

381       lookupNamespace (prefix)
382           Resolves a namespace prefix in the scope of the current element.
383
384           Returns a string containing the namespace URI to which the prefix
385           maps or undef in case of error.
386
387       encoding ()
388           Returns a string containing the encoding of the document or undef
389           in case of error.
390
391       standalone ()
392           Determine the standalone status of the document being read. Returns
393           1 if the document was declared to be standalone, 0 if it was
394           declared to be not standalone, or -1 if the document did not
395           specify its standalone status or in case of error.
396
397       xmlVersion ()
398           Determine the XML version of the document being read. Returns a
399           string containing the XML version of the document or undef in case
400           of error.
401
402       baseURI ()
403           Returns the base URI of a given node.
404
405       isValid ()
406           Retrieve the validity status from the parser.
407
408           Returns 1 if valid, 0 if no, and -1 in case of error.
409
410       xmlLang ()
411           The xml:lang scope within which the node resides.
412
413       lineNumber ()
414           Provide the line number of the current parsing point.
415
416       columnNumber ()
417           Provide the column number of the current parsing point.
418
419       byteConsumed ()
420           This function provides the current index of the parser relative to
421           the start of the current entity. This function is computed in bytes
422           from the beginning starting at zero and finishing at the size in
423           bytes of the file if parsing a file. The function is of constant
424           cost if the input is UTF-8 but can be costly if run on non-UTF-8
425           input.
426
427       setParserProp (prop => value, ...)
428           Change the parser processing behaviour by changing some of its
429           internal properties. The following properties are available with
430           this function: ``load_ext_dtd'', ``complete_attributes'',
431           ``validation'', ``expand_entities''.
432
433           Since some of the properties can only be changed before any read
434           has been done, it is best to set the parsing properties at the
435           constructor.
436
437           Returns 0 if the call was successful, or -1 in case of error
438
439       getParserProp (prop)
440           Get value of an parser internal property. The following property
441           names can be used: ``load_ext_dtd'', ``complete_attributes'',
442           ``validation'', ``expand_entities''.
443
444           Returns the value, usually 0 or 1, or -1 in case of error.
445

DESTRUCTION

447       XML::LibXML takes care of the reader object destruction when the last
448       reference to the reader object goes out of scope. The document tree is
449       preserved, though, if either of $reader->document or
450       $reader->preserveNode was used and references to the document tree
451       exist.
452

NODE TYPES

454       The reader interface provides the following constants for node types
455       (the constant symbols are exported by default or if tag ":types" is
456       used).
457
458         XML_READER_TYPE_NONE                    => 0
459         XML_READER_TYPE_ELEMENT                 => 1
460         XML_READER_TYPE_ATTRIBUTE               => 2
461         XML_READER_TYPE_TEXT                    => 3
462         XML_READER_TYPE_CDATA                   => 4
463         XML_READER_TYPE_ENTITY_REFERENCE        => 5
464         XML_READER_TYPE_ENTITY                  => 6
465         XML_READER_TYPE_PROCESSING_INSTRUCTION  => 7
466         XML_READER_TYPE_COMMENT                 => 8
467         XML_READER_TYPE_DOCUMENT                => 9
468         XML_READER_TYPE_DOCUMENT_TYPE           => 10
469         XML_READER_TYPE_DOCUMENT_FRAGMENT       => 11
470         XML_READER_TYPE_NOTATION                => 12
471         XML_READER_TYPE_WHITESPACE              => 13
472         XML_READER_TYPE_SIGNIFICANT_WHITESPACE  => 14
473         XML_READER_TYPE_END_ELEMENT             => 15
474         XML_READER_TYPE_END_ENTITY              => 16
475         XML_READER_TYPE_XML_DECLARATION         => 17
476

STATES

478       The following constants represent the values returned by "readState()".
479       They are exported by default, or if tag ":states" is used:
480
481         XML_READER_NONE      => -1
482         XML_READER_START     =>  0
483         XML_READER_ELEMENT   =>  1
484         XML_READER_END       =>  2
485         XML_READER_EMPTY     =>  3
486         XML_READER_BACKTRACK =>  4
487         XML_READER_DONE      =>  5
488         XML_READER_ERROR     =>  6
489

SEE ALSO

491       XML::LibXML::Pattern for information about compiled patterns.
492
493       http://xmlsoft.org/html/libxml-xmlreader.html
494
495       http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html
496

ORIGINAL IMPLEMENTATION

498       Heiko Klein, <H.Klein@gmx.net<gt> and Petr Pajas
499

AUTHORS

501       Matt Sergeant, Christian Glahn, Petr Pajas
502

VERSION

504       2.0207
505
507       2001-2007, AxKit.com Ltd.
508
509       2002-2006, Christian Glahn.
510
511       2006-2009, Petr Pajas.
512

LICENSE

514       This program is free software; you can redistribute it and/or modify it
515       under the same terms as Perl itself.
516
517
518
519perl v5.32.1                      2021-04-19            XML::LibXML::Reader(3)
Impressum