1Parser(3) User Contributed Perl Documentation Parser(3)
2
3
4
6 XML::Parser - A perl module for parsing XML documents
7
9 use XML::Parser;
10
11 $p1 = XML::Parser->new(Style => 'Debug');
12 $p1->parsefile('REC-xml-19980210.xml');
13 $p1->parse('<foo id="me">Hello World</foo>');
14
15 # Alternative
16 $p2 = XML::Parser->new(Handlers => {Start => \&handle_start,
17 End => \&handle_end,
18 Char => \&handle_char});
19 $p2->parse($socket);
20
21 # Another alternative
22 $p3 = XML::Parser->new(ErrorContext => 2);
23
24 $p3->setHandlers(Char => \&text,
25 Default => \&other);
26
27 open(my $fh, 'xmlgenerator |');
28 $p3->parse($foo, ProtocolEncoding => 'ISO-8859-1');
29 close($foo);
30
31 $p3->parsefile('junk.xml', ErrorContext => 3);
32
34 This module provides ways to parse XML documents. It is built on top of
35 XML::Parser::Expat, which is a lower level interface to James Clark's
36 expat library. Each call to one of the parsing methods creates a new
37 instance of XML::Parser::Expat which is then used to parse the
38 document. Expat options may be provided when the XML::Parser object is
39 created. These options are then passed on to the Expat object on each
40 parse call. They can also be given as extra arguments to the parse
41 methods, in which case they override options given at XML::Parser
42 creation time.
43
44 The behavior of the parser is controlled either by "STYLES" and/or
45 "HANDLERS" options, or by "setHandlers" method. These all provide
46 mechanisms for XML::Parser to set the handlers needed by
47 XML::Parser::Expat. If neither "Style" nor "Handlers" are specified,
48 then parsing just checks the document for being well-formed.
49
50 When underlying handlers get called, they receive as their first
51 parameter the Expat object, not the Parser object.
52
54 new This is a class method, the constructor for XML::Parser. Options
55 are passed as keyword value pairs. Recognized options are:
56
57 · Style
58
59 This option provides an easy way to create a given style of
60 parser. The built in styles are: "Debug", "Subs", "Tree",
61 "Objects", and "Stream". These are all defined in separate
62 packages under "XML::Parser::Style::*", and you can find
63 further documentation for each style both below, and in those
64 packages.
65
66 Custom styles can be provided by giving a full package name
67 containing at least one '::'. This package should then have
68 subs defined for each handler it wishes to have installed. See
69 "STYLES" below for a discussion of each built in style.
70
71 · Handlers
72
73 When provided, this option should be an anonymous hash
74 containing as keys the type of handler and as values a sub
75 reference to handle that type of event. All the handlers get
76 passed as their 1st parameter the instance of expat that is
77 parsing the document. Further details on handlers can be found
78 in "HANDLERS". Any handler set here overrides the corresponding
79 handler set with the Style option.
80
81 · Pkg
82
83 Some styles will refer to subs defined in this package. If not
84 provided, it defaults to the package which called the
85 constructor.
86
87 · ErrorContext
88
89 This is an Expat option. When this option is defined, errors
90 are reported in context. The value should be the number of
91 lines to show on either side of the line in which the error
92 occurred.
93
94 · ProtocolEncoding
95
96 This is an Expat option. This sets the protocol encoding name.
97 It defaults to none. The built-in encodings are: "UTF-8",
98 "ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be
99 used if they have encoding maps in one of the directories in
100 the @Encoding_Path list. Check "ENCODINGS" for more information
101 on encoding maps. Setting the protocol encoding overrides any
102 encoding in the XML declaration.
103
104 · Namespaces
105
106 This is an Expat option. If this is set to a true value, then
107 namespace processing is done during the parse. See "Namespaces"
108 in XML::Parser::Expat for further discussion of namespace
109 processing.
110
111 · NoExpand
112
113 This is an Expat option. Normally, the parser will try to
114 expand references to entities defined in the internal subset.
115 If this option is set to a true value, and a default handler is
116 also set, then the default handler will be called when an
117 entity reference is seen in text. This has no effect if a
118 default handler has not been registered, and it has no effect
119 on the expansion of entity references inside attribute values.
120
121 · Stream_Delimiter
122
123 This is an Expat option. It takes a string value. When this
124 string is found alone on a line while parsing from a stream,
125 then the parse is ended as if it saw an end of file. The
126 intended use is with a stream of xml documents in a MIME
127 multipart format. The string should not contain a trailing
128 newline.
129
130 · ParseParamEnt
131
132 This is an Expat option. Unless standalone is set to "yes" in
133 the XML declaration, setting this to a true value allows the
134 external DTD to be read, and parameter entities to be parsed
135 and expanded.
136
137 · NoLWP
138
139 This option has no effect if the ExternEnt or ExternEntFin
140 handlers are directly set. Otherwise, if true, it forces the
141 use of a file based external entity handler.
142
143 · Non_Expat_Options
144
145 If provided, this should be an anonymous hash whose keys are
146 options that shouldn't be passed to Expat. This should only be
147 of concern to those subclassing XML::Parser.
148
149 setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
150 This method registers handlers for various parser events. It
151 overrides any previous handlers registered through the Style or
152 Handler options or through earlier calls to setHandlers. By
153 providing a false or undefined value as the handler, the existing
154 handler can be unset.
155
156 This method returns a list of type, handler pairs corresponding to
157 the input. The handlers returned are the ones that were in effect
158 prior to the call.
159
160 See a description of the handler types in "HANDLERS".
161
162 parse(SOURCE [, OPT => OPT_VALUE [...]])
163 The SOURCE parameter should either be a string containing the whole
164 XML document, or it should be an open IO::Handle. Constructor
165 options to XML::Parser::Expat given as keyword-value pairs may
166 follow the SOURCE parameter. These override, for this call, any
167 options or attributes passed through from the XML::Parser instance.
168
169 A die call is thrown if a parse error occurs. Otherwise it will
170 return 1 or whatever is returned from the Final handler, if one is
171 installed. In other words, what parse may return depends on the
172 style.
173
174 parsestring
175 This is just an alias for parse for backwards compatibility.
176
177 parsefile(FILE [, OPT => OPT_VALUE [...]])
178 Open FILE for reading, then call parse with the open handle. The
179 file is closed no matter how parse returns. Returns what parse
180 returns.
181
182 parse_start([ OPT => OPT_VALUE [...]])
183 Create and return a new instance of XML::Parser::ExpatNB.
184 Constructor options may be provided. If an init handler has been
185 provided, it is called before returning the ExpatNB object.
186 Documents are parsed by making incremental calls to the parse_more
187 method of this object, which takes a string. A single call to the
188 parse_done method of this object, which takes no arguments,
189 indicates that the document is finished.
190
191 If there is a final handler installed, it is executed by the
192 parse_done method before returning and the parse_done method
193 returns whatever is returned by the final handler.
194
196 Expat is an event based parser. As the parser recognizes parts of the
197 document (say the start or end tag for an XML element), then any
198 handlers registered for that type of an event are called with suitable
199 parameters. All handlers receive an instance of XML::Parser::Expat as
200 their first argument. See "METHODS" in XML::Parser::Expat for a
201 discussion of the methods that can be called on this object.
202
203 Init (Expat)
204 This is called just before the parsing of the document starts.
205
206 Final (Expat)
207 This is called just after parsing has finished, but only if no errors
208 occurred during the parse. Parse returns what this returns.
209
210 Start (Expat, Element [, Attr, Val [,...]])
211 This event is generated when an XML start tag is recognized. Element is
212 the name of the XML element type that is opened with the start tag. The
213 Attr & Val pairs are generated for each attribute in the start tag.
214
215 End (Expat, Element)
216 This event is generated when an XML end tag is recognized. Note that an
217 XML empty tag (<foo/>) generates both a start and an end event.
218
219 Char (Expat, String)
220 This event is generated when non-markup is recognized. The non-markup
221 sequence of characters is in String. A single non-markup sequence of
222 characters may generate multiple calls to this handler. Whatever the
223 encoding of the string in the original document, this is given to the
224 handler in UTF-8.
225
226 Proc (Expat, Target, Data)
227 This event is generated when a processing instruction is recognized.
228
229 Comment (Expat, Data)
230 This event is generated when a comment is recognized.
231
232 CdataStart (Expat)
233 This is called at the start of a CDATA section.
234
235 CdataEnd (Expat)
236 This is called at the end of a CDATA section.
237
238 Default (Expat, String)
239 This is called for any characters that don't have a registered handler.
240 This includes both characters that are part of markup for which no
241 events are generated (markup declarations) and characters that could
242 generate events, but for which no handler has been registered.
243
244 Whatever the encoding in the original document, the string is returned
245 to the handler in UTF-8.
246
247 Unparsed (Expat, Entity, Base, Sysid, Pubid, Notation)
248 This is called for a declaration of an unparsed entity. Entity is the
249 name of the entity. Base is the base to be used for resolving a
250 relative URI. Sysid is the system id. Pubid is the public id. Notation
251 is the notation name. Base and Pubid may be undefined.
252
253 Notation (Expat, Notation, Base, Sysid, Pubid)
254 This is called for a declaration of notation. Notation is the notation
255 name. Base is the base to be used for resolving a relative URI. Sysid
256 is the system id. Pubid is the public id. Base, Sysid, and Pubid may
257 all be undefined.
258
259 ExternEnt (Expat, Base, Sysid, Pubid)
260 This is called when an external entity is referenced. Base is the base
261 to be used for resolving a relative URI. Sysid is the system id. Pubid
262 is the public id. Base, and Pubid may be undefined.
263
264 This handler should either return a string, which represents the
265 contents of the external entity, or return an open filehandle that can
266 be read to obtain the contents of the external entity, or return undef,
267 which indicates the external entity couldn't be found and will generate
268 a parse error.
269
270 If an open filehandle is returned, it must be returned as either a glob
271 (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle).
272
273 A default handler is installed for this event. The default handler is
274 XML::Parser::lwp_ext_ent_handler unless the NoLWP option was provided
275 with a true value, otherwise XML::Parser::file_ext_ent_handler is the
276 default handler for external entities. Even without the NoLWP option,
277 if the URI or LWP modules are missing, the file based handler ends up
278 being used after giving a warning on the first external entity
279 reference.
280
281 The LWP external entity handler will use proxies defined in the
282 environment (http_proxy, ftp_proxy, etc.).
283
284 Please note that the LWP external entity handler reads the entire
285 entity into a string and returns it, where as the file handler opens a
286 filehandle.
287
288 Also note that the file external entity handler will likely choke on
289 absolute URIs or file names that don't fit the conventions of the local
290 operating system.
291
292 The expat base method can be used to set a basename for relative
293 pathnames. If no basename is given, or if the basename is itself a
294 relative name, then it is relative to the current working directory.
295
296 ExternEntFin (Expat)
297 This is called after parsing an external entity. It's not called unless
298 an ExternEnt handler is also set. There is a default handler installed
299 that pairs with the default ExternEnt handler.
300
301 If you're going to install your own ExternEnt handler, then you should
302 set (or unset) this handler too.
303
304 Entity (Expat, Name, Val, Sysid, Pubid, Ndata, IsParam)
305 This is called when an entity is declared. For internal entities, the
306 Val parameter will contain the value and the remaining three parameters
307 will be undefined. For external entities, the Val parameter will be
308 undefined, the Sysid parameter will have the system id, the Pubid
309 parameter will have the public id if it was provided (it will be
310 undefined otherwise), the Ndata parameter will contain the notation for
311 unparsed entities. If this is a parameter entity declaration, then the
312 IsParam parameter is true.
313
314 Note that this handler and the Unparsed handler above overlap. If both
315 are set, then this handler will not be called for unparsed entities.
316
317 Element (Expat, Name, Model)
318 The element handler is called when an element declaration is found.
319 Name is the element name, and Model is the content model as an
320 XML::Parser::Content object. See "XML::Parser::ContentModel Methods" in
321 XML::Parser::Expat for methods available for this class.
322
323 Attlist (Expat, Elname, Attname, Type, Default, Fixed)
324 This handler is called for each attribute in an ATTLIST declaration.
325 So an ATTLIST declaration that has multiple attributes will generate
326 multiple calls to this handler. The Elname parameter is the name of the
327 element with which the attribute is being associated. The Attname
328 parameter is the name of the attribute. Type is the attribute type,
329 given as a string. Default is the default value, which will either be
330 "#REQUIRED", "#IMPLIED" or a quoted string (i.e. the returned string
331 will begin and end with a quote character). If Fixed is true, then
332 this is a fixed attribute.
333
334 Doctype (Expat, Name, Sysid, Pubid, Internal)
335 This handler is called for DOCTYPE declarations. Name is the document
336 type name. Sysid is the system id of the document type, if it was
337 provided, otherwise it's undefined. Pubid is the public id of the
338 document type, which will be undefined if no public id was given.
339 Internal is the internal subset, given as a string. If there was no
340 internal subset, it will be undefined. Internal will contain all
341 whitespace, comments, processing instructions, and declarations seen in
342 the internal subset. The declarations will be there whether or not they
343 have been processed by another handler (except for unparsed entities
344 processed by the Unparsed handler). However, comments and processing
345 instructions will not appear if they've been processed by their
346 respective handlers.
347
348 * DoctypeFin (Parser)
349 This handler is called after parsing of the DOCTYPE declaration has
350 finished, including any internal or external DTD declarations.
351
352 XMLDecl (Expat, Version, Encoding, Standalone)
353 This handler is called for xml declarations. Version is a string
354 containing the version. Encoding is either undefined or contains an
355 encoding string. Standalone will be either true, false, or undefined
356 if the standalone attribute is yes, no, or not made respectively.
357
359 Debug
360 This just prints out the document in outline form. Nothing special is
361 returned by parse.
362
363 Subs
364 Each time an element starts, a sub by that name in the package
365 specified by the Pkg option is called with the same parameters that the
366 Start handler gets called with.
367
368 Each time an element ends, a sub with that name appended with an
369 underscore ("_"), is called with the same parameters that the End
370 handler gets called with.
371
372 Nothing special is returned by parse.
373
374 Tree
375 Parse will return a parse tree for the document. Each node in the tree
376 takes the form of a tag, content pair. Text nodes are represented with
377 a pseudo-tag of "0" and the string that is their content. For elements,
378 the content is an array reference. The first item in the array is a
379 (possibly empty) hash reference containing attributes. The remainder of
380 the array is a sequence of tag-content pairs representing the content
381 of the element.
382
383 So for example the result of parsing:
384
385 <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo>
386
387 would be:
388
389 Tag Content
390 ==================================================================
391 [foo, [{}, head, [{id => "a"}, 0, "Hello ", em, [{}, 0, "there"]],
392 bar, [ {}, 0, "Howdy", ref, [{}]],
393 0, "do"
394 ]
395 ]
396
397 The root document "foo", has 3 children: a "head" element, a "bar"
398 element and the text "do". After the empty attribute hash, these are
399 represented in it's contents by 3 tag-content pairs.
400
401 Objects
402 This is similar to the Tree style, except that a hash object is created
403 for each element. The corresponding object will be in the class whose
404 name is created by appending "::" and the element name to the package
405 set with the Pkg option. Non-markup text will be in the ::Characters
406 class. The contents of the corresponding object will be in an anonymous
407 array that is the value of the Kids property for that object.
408
409 Stream
410 This style also uses the Pkg package. If none of the subs that this
411 style looks for is there, then the effect of parsing with this style is
412 to print a canonical copy of the document without comments or
413 declarations. All the subs receive as their 1st parameter the Expat
414 instance for the document they're parsing.
415
416 It looks for the following routines:
417
418 · StartDocument
419
420 Called at the start of the parse .
421
422 · StartTag
423
424 Called for every start tag with a second parameter of the element
425 type. The $_ variable will contain a copy of the tag and the %_
426 variable will contain attribute values supplied for that element.
427
428 · EndTag
429
430 Called for every end tag with a second parameter of the element
431 type. The $_ variable will contain a copy of the end tag.
432
433 · Text
434
435 Called just before start or end tags with accumulated non-markup
436 text in the $_ variable.
437
438 · PI
439
440 Called for processing instructions. The $_ variable will contain a
441 copy of the PI and the target and data are sent as 2nd and 3rd
442 parameters respectively.
443
444 · EndDocument
445
446 Called at conclusion of the parse.
447
449 XML documents may be encoded in character sets other than Unicode as
450 long as they may be mapped into the Unicode character set. Expat has
451 further restrictions on encodings. Read the xmlparse.h header file in
452 the expat distribution to see details on these restrictions.
453
454 Expat has built-in encodings for: "UTF-8", "ISO-8859-1", "UTF-16", and
455 "US-ASCII". Encodings are set either through the XML declaration
456 encoding attribute or through the ProtocolEncoding option to
457 XML::Parser or XML::Parser::Expat.
458
459 For encodings other than the built-ins, expat calls the function
460 load_encoding in the Expat package with the encoding name. This
461 function looks for a file in the path list
462 @XML::Parser::Expat::Encoding_Path, that matches the lower-cased name
463 with a '.enc' extension. The first one it finds, it loads.
464
465 If you wish to build your own encoding maps, check out the
466 XML::Encoding module from CPAN.
467
469 Larry Wall <larry@wall.org> wrote version 1.0.
470
471 Clark Cooper <coopercc@netheaven.com> picked up support, changed the
472 API for this version (2.x), provided documentation, and added some
473 standard package features.
474
475 Matt Sergeant <matt@sergeant.org> is now maintaining XML::Parser
476
477
478
479perl v5.30.1 2020-01-30 Parser(3)