1Parser(3) User Contributed Perl Documentation Parser(3)
2
3
4
6 XML::Parser - A perl module for parsing XML documents
7
9 use XML::Parser;
10
11 $p1 = new XML::Parser(Style => 'Debug');
12 $p1->parsefile('REC-xml-19980210.xml');
13 $p1->parse('<foo id="me">Hello World</foo>');
14
15 # Alternative
16 $p2 = new XML::Parser(Handlers => {Start => \&handle_start,
17 End => \&handle_end,
18 Char => \&handle_char});
19 $p2->parse($socket);
20
21 # Another alternative
22 $p3 = new XML::Parser(ErrorContext => 2);
23
24 $p3->setHandlers(Char => \&text,
25 Default => \&other);
26
27 open(FOO, 'xmlgenerator ⎪');
28 $p3->parse(*FOO, ProtocolEncoding => 'ISO-8859-1');
29 close(FOO);
30
31 $p3->parsefile('junk.xml', ErrorContext => 3);
32
34 This module provides ways to parse XML documents. It is built on top of
35 XML::Parser::Expat, which is a lower level interface to James Clark's
36 expat library. Each call to one of the parsing methods creates a new
37 instance of XML::Parser::Expat which is then used to parse the docu‐
38 ment. Expat options may be provided when the XML::Parser object is
39 created. These options are then passed on to the Expat object on each
40 parse call. They can also be given as extra arguments to the parse
41 methods, in which case they override options given at XML::Parser cre‐
42 ation time.
43
44 The behavior of the parser is controlled either by ""Style"" and/or
45 ""Handlers"" options, or by "setHandlers" method. These all provide
46 mechanisms for XML::Parser to set the handlers needed by
47 XML::Parser::Expat. If neither "Style" nor "Handlers" are specified,
48 then parsing just checks the document for being well-formed.
49
50 When underlying handlers get called, they receive as their first param‐
51 eter the Expat object, not the Parser object.
52
54 new This is a class method, the constructor for XML::Parser. Options
55 are passed as keyword value pairs. Recognized options are:
56
57 * Style
58 This option provides an easy way to create a given style of
59 parser. The built in styles are: "Debug", "Subs", "Tree",
60 "Objects", and "Stream". These are all defined in separate
61 packages under "XML::Parser::Style::*", and you can find fur‐
62 ther documentation for each style both below, and in those
63 packages.
64
65 Custom styles can be provided by giving a full package name
66 containing at least one '::'. This package should then have
67 subs defined for each handler it wishes to have installed. See
68 "STYLES" below for a discussion of each built in style.
69
70 * Handlers
71 When provided, this option should be an anonymous hash contain‐
72 ing as keys the type of handler and as values a sub reference
73 to handle that type of event. All the handlers get passed as
74 their 1st parameter the instance of expat that is parsing the
75 document. Further details on handlers can be found in "HAN‐
76 DLERS". Any handler set here overrides the corresponding han‐
77 dler set with the Style option.
78
79 * Pkg
80 Some styles will refer to subs defined in this package. If not
81 provided, it defaults to the package which called the construc‐
82 tor.
83
84 * ErrorContext
85 This is an Expat option. When this option is defined, errors
86 are reported in context. The value should be the number of
87 lines to show on either side of the line in which the error
88 occurred.
89
90 * ProtocolEncoding
91 This is an Expat option. This sets the protocol encoding name.
92 It defaults to none. The built-in encodings are: "UTF-8",
93 "ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be
94 used if they have encoding maps in one of the directories in
95 the @Encoding_Path list. Check "ENCODINGS" for more information
96 on encoding maps. Setting the protocol encoding overrides any
97 encoding in the XML declaration.
98
99 * Namespaces
100 This is an Expat option. If this is set to a true value, then
101 namespace processing is done during the parse. See "Namespaces"
102 in XML::Parser::Expat for further discussion of namespace pro‐
103 cessing.
104
105 * NoExpand
106 This is an Expat option. Normally, the parser will try to
107 expand references to entities defined in the internal subset.
108 If this option is set to a true value, and a default handler is
109 also set, then the default handler will be called when an
110 entity reference is seen in text. This has no effect if a
111 default handler has not been registered, and it has no effect
112 on the expansion of entity references inside attribute values.
113
114 * Stream_Delimiter
115 This is an Expat option. It takes a string value. When this
116 string is found alone on a line while parsing from a stream,
117 then the parse is ended as if it saw an end of file. The
118 intended use is with a stream of xml documents in a MIME multi‐
119 part format. The string should not contain a trailing newline.
120
121 * ParseParamEnt
122 This is an Expat option. Unless standalone is set to "yes" in
123 the XML declaration, setting this to a true value allows the
124 external DTD to be read, and parameter entities to be parsed
125 and expanded.
126
127 * NoLWP
128 This option has no effect if the ExternEnt or ExternEntFin han‐
129 dlers are directly set. Otherwise, if true, it forces the use
130 of a file based external entity handler.
131
132 * Non-Expat-Options
133 If provided, this should be an anonymous hash whose keys are
134 options that shouldn't be passed to Expat. This should only be
135 of concern to those subclassing XML::Parser.
136
137 setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
138 This method registers handlers for various parser events. It over‐
139 rides any previous handlers registered through the Style or Handler
140 options or through earlier calls to setHandlers. By providing a
141 false or undefined value as the handler, the existing handler can
142 be unset.
143
144 This method returns a list of type, handler pairs corresponding to
145 the input. The handlers returned are the ones that were in effect
146 prior to the call.
147
148 See a description of the handler types in "HANDLERS".
149
150 parse(SOURCE [, OPT => OPT_VALUE [...]])
151 The SOURCE parameter should either be a string containing the whole
152 XML document, or it should be an open IO::Handle. Constructor
153 options to XML::Parser::Expat given as keyword-value pairs may fol‐
154 low the SOURCE parameter. These override, for this call, any
155 options or attributes passed through from the XML::Parser instance.
156
157 A die call is thrown if a parse error occurs. Otherwise it will
158 return 1 or whatever is returned from the Final handler, if one is
159 installed. In other words, what parse may return depends on the
160 style.
161
162 parsestring
163 This is just an alias for parse for backwards compatibility.
164
165 parsefile(FILE [, OPT => OPT_VALUE [...]])
166 Open FILE for reading, then call parse with the open handle. The
167 file is closed no matter how parse returns. Returns what parse
168 returns.
169
170 parse_start([ OPT => OPT_VALUE [...]])
171 Create and return a new instance of XML::Parser::ExpatNB. Construc‐
172 tor options may be provided. If an init handler has been provided,
173 it is called before returning the ExpatNB object. Documents are
174 parsed by making incremental calls to the parse_more method of this
175 object, which takes a string. A single call to the parse_done
176 method of this object, which takes no arguments, indicates that the
177 document is finished.
178
179 If there is a final handler installed, it is executed by the
180 parse_done method before returning and the parse_done method
181 returns whatever is returned by the final handler.
182
184 Expat is an event based parser. As the parser recognizes parts of the
185 document (say the start or end tag for an XML element), then any han‐
186 dlers registered for that type of an event are called with suitable
187 parameters. All handlers receive an instance of XML::Parser::Expat as
188 their first argument. See "METHODS" in XML::Parser::Expat for a discus‐
189 sion of the methods that can be called on this object.
190
191 Init (Expat)
192
193 This is called just before the parsing of the document starts.
194
195 Final (Expat)
196
197 This is called just after parsing has finished, but only if no errors
198 occurred during the parse. Parse returns what this returns.
199
200 Start (Expat, Element [, Attr, Val [,...]])
201
202 This event is generated when an XML start tag is recognized. Element is
203 the name of the XML element type that is opened with the start tag. The
204 Attr & Val pairs are generated for each attribute in the start tag.
205
206 End (Expat, Element)
207
208 This event is generated when an XML end tag is recognized. Note that an
209 XML empty tag (<foo/>) generates both a start and an end event.
210
211 Char (Expat, String)
212
213 This event is generated when non-markup is recognized. The non-markup
214 sequence of characters is in String. A single non-markup sequence of
215 characters may generate multiple calls to this handler. Whatever the
216 encoding of the string in the original document, this is given to the
217 handler in UTF-8.
218
219 Proc (Expat, Target, Data)
220
221 This event is generated when a processing instruction is recognized.
222
223 Comment (Expat, Data)
224
225 This event is generated when a comment is recognized.
226
227 CdataStart (Expat)
228
229 This is called at the start of a CDATA section.
230
231 CdataEnd (Expat)
232
233 This is called at the end of a CDATA section.
234
235 Default (Expat, String)
236
237 This is called for any characters that don't have a registered handler.
238 This includes both characters that are part of markup for which no
239 events are generated (markup declarations) and characters that could
240 generate events, but for which no handler has been registered.
241
242 Whatever the encoding in the original document, the string is returned
243 to the handler in UTF-8.
244
245 Unparsed (Expat, Entity, Base, Sysid, Pubid, Notation)
246
247 This is called for a declaration of an unparsed entity. Entity is the
248 name of the entity. Base is the base to be used for resolving a rela‐
249 tive URI. Sysid is the system id. Pubid is the public id. Notation is
250 the notation name. Base and Pubid may be undefined.
251
252 Notation (Expat, Notation, Base, Sysid, Pubid)
253
254 This is called for a declaration of notation. Notation is the notation
255 name. Base is the base to be used for resolving a relative URI. Sysid
256 is the system id. Pubid is the public id. Base, Sysid, and Pubid may
257 all be undefined.
258
259 ExternEnt (Expat, Base, Sysid, Pubid)
260
261 This is called when an external entity is referenced. Base is the base
262 to be used for resolving a relative URI. Sysid is the system id. Pubid
263 is the public id. Base, and Pubid may be undefined.
264
265 This handler should either return a string, which represents the con‐
266 tents of the external entity, or return an open filehandle that can be
267 read to obtain the contents of the external entity, or return undef,
268 which indicates the external entity couldn't be found and will generate
269 a parse error.
270
271 If an open filehandle is returned, it must be returned as either a glob
272 (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle).
273
274 A default handler is installed for this event. The default handler is
275 XML::Parser::lwp_ext_ent_handler unless the NoLWP option was provided
276 with a true value, otherwise XML::Parser::file_ext_ent_handler is the
277 default handler for external entities. Even without the NoLWP option,
278 if the URI or LWP modules are missing, the file based handler ends up
279 being used after giving a warning on the first external entity refer‐
280 ence.
281
282 The LWP external entity handler will use proxies defined in the envi‐
283 ronment (http_proxy, ftp_proxy, etc.).
284
285 Please note that the LWP external entity handler reads the entire
286 entity into a string and returns it, where as the file handler opens a
287 filehandle.
288
289 Also note that the file external entity handler will likely choke on
290 absolute URIs or file names that don't fit the conventions of the local
291 operating system.
292
293 The expat base method can be used to set a basename for relative path‐
294 names. If no basename is given, or if the basename is itself a relative
295 name, then it is relative to the current working directory.
296
297 ExternEntFin (Expat)
298
299 This is called after parsing an external entity. It's not called unless
300 an ExternEnt handler is also set. There is a default handler installed
301 that pairs with the default ExternEnt handler.
302
303 If you're going to install your own ExternEnt handler, then you should
304 set (or unset) this handler too.
305
306 Entity (Expat, Name, Val, Sysid, Pubid, Ndata, IsParam)
307
308 This is called when an entity is declared. For internal entities, the
309 Val parameter will contain the value and the remaining three parameters
310 will be undefined. For external entities, the Val parameter will be
311 undefined, the Sysid parameter will have the system id, the Pubid
312 parameter will have the public id if it was provided (it will be unde‐
313 fined otherwise), the Ndata parameter will contain the notation for
314 unparsed entities. If this is a parameter entity declaration, then the
315 IsParam parameter is true.
316
317 Note that this handler and the Unparsed handler above overlap. If both
318 are set, then this handler will not be called for unparsed entities.
319
320 Element (Expat, Name, Model)
321
322 The element handler is called when an element declaration is found.
323 Name is the element name, and Model is the content model as an
324 XML::Parser::Content object. See "XML::Parser::ContentModel Methods" in
325 XML::Parser::Expat for methods available for this class.
326
327 Attlist (Expat, Elname, Attname, Type, Default, Fixed)
328
329 This handler is called for each attribute in an ATTLIST declaration.
330 So an ATTLIST declaration that has multiple attributes will generate
331 multiple calls to this handler. The Elname parameter is the name of the
332 element with which the attribute is being associated. The Attname
333 parameter is the name of the attribute. Type is the attribute type,
334 given as a string. Default is the default value, which will either be
335 "#REQUIRED", "#IMPLIED" or a quoted string (i.e. the returned string
336 will begin and end with a quote character). If Fixed is true, then
337 this is a fixed attribute.
338
339 Doctype (Expat, Name, Sysid, Pubid, Internal)
340
341 This handler is called for DOCTYPE declarations. Name is the document
342 type name. Sysid is the system id of the document type, if it was pro‐
343 vided, otherwise it's undefined. Pubid is the public id of the document
344 type, which will be undefined if no public id was given. Internal is
345 the internal subset, given as a string. If there was no internal sub‐
346 set, it will be undefined. Internal will contain all whitespace, com‐
347 ments, processing instructions, and declarations seen in the internal
348 subset. The declarations will be there whether or not they have been
349 processed by another handler (except for unparsed entities processed by
350 the Unparsed handler). However, comments and processing instructions
351 will not appear if they've been processed by their respective handlers.
352
353 * DoctypeFin (Parser)
354
355 This handler is called after parsing of the DOCTYPE declaration has
356 finished, including any internal or external DTD declarations.
357
358 XMLDecl (Expat, Version, Encoding, Standalone)
359
360 This handler is called for xml declarations. Version is a string con‐
361 taing the version. Encoding is either undefined or contains an encoding
362 string. Standalone will be either true, false, or undefined if the
363 standalone attribute is yes, no, or not made respectively.
364
366 Debug
367
368 This just prints out the document in outline form. Nothing special is
369 returned by parse.
370
371 Subs
372
373 Each time an element starts, a sub by that name in the package speci‐
374 fied by the Pkg option is called with the same parameters that the
375 Start handler gets called with.
376
377 Each time an element ends, a sub with that name appended with an under‐
378 score ("_"), is called with the same parameters that the End handler
379 gets called with.
380
381 Nothing special is returned by parse.
382
383 Tree
384
385 Parse will return a parse tree for the document. Each node in the tree
386 takes the form of a tag, content pair. Text nodes are represented with
387 a pseudo-tag of "0" and the string that is their content. For elements,
388 the content is an array reference. The first item in the array is a
389 (possibly empty) hash reference containing attributes. The remainder of
390 the array is a sequence of tag-content pairs representing the content
391 of the element.
392
393 So for example the result of parsing:
394
395 <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo>
396
397 would be:
398
399 Tag Content
400 ==================================================================
401 [foo, [{}, head, [{id => "a"}, 0, "Hello ", em, [{}, 0, "there"]],
402 bar, [ {}, 0, "Howdy", ref, [{}]],
403 0, "do"
404 ]
405 ]
406
407 The root document "foo", has 3 children: a "head" element, a "bar" ele‐
408 ment and the text "do". After the empty attribute hash, these are rep‐
409 resented in it's contents by 3 tag-content pairs.
410
411 Objects
412
413 This is similar to the Tree style, except that a hash object is created
414 for each element. The corresponding object will be in the class whose
415 name is created by appending "::" and the element name to the package
416 set with the Pkg option. Non-markup text will be in the ::Characters
417 class. The contents of the corresponding object will be in an anonymous
418 array that is the value of the Kids property for that object.
419
420 Stream
421
422 This style also uses the Pkg package. If none of the subs that this
423 style looks for is there, then the effect of parsing with this style is
424 to print a canonical copy of the document without comments or declara‐
425 tions. All the subs receive as their 1st parameter the Expat instance
426 for the document they're parsing.
427
428 It looks for the following routines:
429
430 * StartDocument
431 Called at the start of the parse .
432
433 * StartTag
434 Called for every start tag with a second parameter of the element
435 type. The $_ variable will contain a copy of the tag and the %_
436 variable will contain attribute values supplied for that element.
437
438 * EndTag
439 Called for every end tag with a second parameter of the element
440 type. The $_ variable will contain a copy of the end tag.
441
442 * Text
443 Called just before start or end tags with accumulated non-markup
444 text in the $_ variable.
445
446 * PI
447 Called for processing instructions. The $_ variable will contain a
448 copy of the PI and the target and data are sent as 2nd and 3rd
449 parameters respectively.
450
451 * EndDocument
452 Called at conclusion of the parse.
453
455 XML documents may be encoded in character sets other than Unicode as
456 long as they may be mapped into the Unicode character set. Expat has
457 further restrictions on encodings. Read the xmlparse.h header file in
458 the expat distribution to see details on these restrictions.
459
460 Expat has built-in encodings for: "UTF-8", "ISO-8859-1", "UTF-16", and
461 "US-ASCII". Encodings are set either through the XML declaration encod‐
462 ing attribute or through the ProtocolEncoding option to XML::Parser or
463 XML::Parser::Expat.
464
465 For encodings other than the built-ins, expat calls the function
466 load_encoding in the Expat package with the encoding name. This func‐
467 tion looks for a file in the path list @XML::Parser::Expat::Encod‐
468 ing_Path, that matches the lower-cased name with a '.enc' extension.
469 The first one it finds, it loads.
470
471 If you wish to build your own encoding maps, check out the XML::Encod‐
472 ing module from CPAN.
473
475 Larry Wall <larry@wall.org> wrote version 1.0.
476
477 Clark Cooper <coopercc@netheaven.com> picked up support, changed the
478 API for this version (2.x), provided documentation, and added some
479 standard package features.
480
481 Matt Sergeant <matt@sergeant.org> is now maintaining XML::Parser
482
483
484
485perl v5.8.8 2003-08-18 Parser(3)