1Parser(3)             User Contributed Perl Documentation            Parser(3)
2
3
4

NAME

6       XML::Parser - A perl module for parsing XML documents
7

SYNOPSIS

9         use XML::Parser;
10
11         $p1 = new XML::Parser(Style => 'Debug');
12         $p1->parsefile('REC-xml-19980210.xml');
13         $p1->parse('<foo id="me">Hello World</foo>');
14
15         # Alternative
16         $p2 = new XML::Parser(Handlers => {Start => \&handle_start,
17                                            End   => \&handle_end,
18                                            Char  => \&handle_char});
19         $p2->parse($socket);
20
21         # Another alternative
22         $p3 = new XML::Parser(ErrorContext => 2);
23
24         $p3->setHandlers(Char    => \&text,
25                          Default => \&other);
26
27         open(FOO, 'xmlgenerator ⎪');
28         $p3->parse(*FOO, ProtocolEncoding => 'ISO-8859-1');
29         close(FOO);
30
31         $p3->parsefile('junk.xml', ErrorContext => 3);
32

DESCRIPTION

34       This module provides ways to parse XML documents. It is built on top of
35       XML::Parser::Expat, which is a lower level interface to James Clark's
36       expat library. Each call to one of the parsing methods creates a new
37       instance of XML::Parser::Expat which is then used to parse the docu‐
38       ment.  Expat options may be provided when the XML::Parser object is
39       created.  These options are then passed on to the Expat object on each
40       parse call.  They can also be given as extra arguments to the parse
41       methods, in which case they override options given at XML::Parser cre‐
42       ation time.
43
44       The behavior of the parser is controlled either by ""Style"" and/or
45       ""Handlers"" options, or by "setHandlers" method. These all provide
46       mechanisms for XML::Parser to set the handlers needed by
47       XML::Parser::Expat.  If neither "Style" nor "Handlers" are specified,
48       then parsing just checks the document for being well-formed.
49
50       When underlying handlers get called, they receive as their first param‐
51       eter the Expat object, not the Parser object.
52

METHODS

54       new This is a class method, the constructor for XML::Parser. Options
55           are passed as keyword value pairs. Recognized options are:
56
57           * Style
58               This option provides an easy way to create a given style of
59               parser. The built in styles are: "Debug", "Subs", "Tree",
60               "Objects", and "Stream". These are all defined in separate
61               packages under "XML::Parser::Style::*", and you can find fur‐
62               ther documentation for each style both below, and in those
63               packages.
64
65               Custom styles can be provided by giving a full package name
66               containing at least one '::'. This package should then have
67               subs defined for each handler it wishes to have installed. See
68               "STYLES" below for a discussion of each built in style.
69
70           * Handlers
71               When provided, this option should be an anonymous hash contain‐
72               ing as keys the type of handler and as values a sub reference
73               to handle that type of event. All the handlers get passed as
74               their 1st parameter the instance of expat that is parsing the
75               document. Further details on handlers can be found in "HAN‐
76               DLERS". Any handler set here overrides the corresponding han‐
77               dler set with the Style option.
78
79           * Pkg
80               Some styles will refer to subs defined in this package. If not
81               provided, it defaults to the package which called the construc‐
82               tor.
83
84           * ErrorContext
85               This is an Expat option. When this option is defined, errors
86               are reported in context. The value should be the number of
87               lines to show on either side of the line in which the error
88               occurred.
89
90           * ProtocolEncoding
91               This is an Expat option. This sets the protocol encoding name.
92               It defaults to none. The built-in encodings are: "UTF-8",
93               "ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be
94               used if they have encoding maps in one of the directories in
95               the @Encoding_Path list. Check "ENCODINGS" for more information
96               on encoding maps. Setting the protocol encoding overrides any
97               encoding in the XML declaration.
98
99           * Namespaces
100               This is an Expat option. If this is set to a true value, then
101               namespace processing is done during the parse. See "Namespaces"
102               in XML::Parser::Expat for further discussion of namespace pro‐
103               cessing.
104
105           * NoExpand
106               This is an Expat option. Normally, the parser will try to
107               expand references to entities defined in the internal subset.
108               If this option is set to a true value, and a default handler is
109               also set, then the default handler will be called when an
110               entity reference is seen in text. This has no effect if a
111               default handler has not been registered, and it has no effect
112               on the expansion of entity references inside attribute values.
113
114           * Stream_Delimiter
115               This is an Expat option. It takes a string value. When this
116               string is found alone on a line while parsing from a stream,
117               then the parse is ended as if it saw an end of file. The
118               intended use is with a stream of xml documents in a MIME multi‐
119               part format. The string should not contain a trailing newline.
120
121           * ParseParamEnt
122               This is an Expat option. Unless standalone is set to "yes" in
123               the XML declaration, setting this to a true value allows the
124               external DTD to be read, and parameter entities to be parsed
125               and expanded.
126
127           * NoLWP
128               This option has no effect if the ExternEnt or ExternEntFin han‐
129               dlers are directly set. Otherwise, if true, it forces the use
130               of a file based external entity handler.
131
132           * Non-Expat-Options
133               If provided, this should be an anonymous hash whose keys are
134               options that shouldn't be passed to Expat. This should only be
135               of concern to those subclassing XML::Parser.
136
137       setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
138           This method registers handlers for various parser events. It over‐
139           rides any previous handlers registered through the Style or Handler
140           options or through earlier calls to setHandlers. By providing a
141           false or undefined value as the handler, the existing handler can
142           be unset.
143
144           This method returns a list of type, handler pairs corresponding to
145           the input. The handlers returned are the ones that were in effect
146           prior to the call.
147
148           See a description of the handler types in "HANDLERS".
149
150       parse(SOURCE [, OPT => OPT_VALUE [...]])
151           The SOURCE parameter should either be a string containing the whole
152           XML document, or it should be an open IO::Handle. Constructor
153           options to XML::Parser::Expat given as keyword-value pairs may fol‐
154           low the SOURCE parameter. These override, for this call, any
155           options or attributes passed through from the XML::Parser instance.
156
157           A die call is thrown if a parse error occurs. Otherwise it will
158           return 1 or whatever is returned from the Final handler, if one is
159           installed.  In other words, what parse may return depends on the
160           style.
161
162       parsestring
163           This is just an alias for parse for backwards compatibility.
164
165       parsefile(FILE [, OPT => OPT_VALUE [...]])
166           Open FILE for reading, then call parse with the open handle. The
167           file is closed no matter how parse returns. Returns what parse
168           returns.
169
170       parse_start([ OPT => OPT_VALUE [...]])
171           Create and return a new instance of XML::Parser::ExpatNB. Construc‐
172           tor options may be provided. If an init handler has been provided,
173           it is called before returning the ExpatNB object. Documents are
174           parsed by making incremental calls to the parse_more method of this
175           object, which takes a string. A single call to the parse_done
176           method of this object, which takes no arguments, indicates that the
177           document is finished.
178
179           If there is a final handler installed, it is executed by the
180           parse_done method before returning and the parse_done method
181           returns whatever is returned by the final handler.
182

HANDLERS

184       Expat is an event based parser. As the parser recognizes parts of the
185       document (say the start or end tag for an XML element), then any han‐
186       dlers registered for that type of an event are called with suitable
187       parameters.  All handlers receive an instance of XML::Parser::Expat as
188       their first argument. See "METHODS" in XML::Parser::Expat for a discus‐
189       sion of the methods that can be called on this object.
190
191       Init                (Expat)
192
193       This is called just before the parsing of the document starts.
194
195       Final                (Expat)
196
197       This is called just after parsing has finished, but only if no errors
198       occurred during the parse. Parse returns what this returns.
199
200       Start                (Expat, Element [, Attr, Val [,...]])
201
202       This event is generated when an XML start tag is recognized. Element is
203       the name of the XML element type that is opened with the start tag. The
204       Attr & Val pairs are generated for each attribute in the start tag.
205
206       End                (Expat, Element)
207
208       This event is generated when an XML end tag is recognized. Note that an
209       XML empty tag (<foo/>) generates both a start and an end event.
210
211       Char                (Expat, String)
212
213       This event is generated when non-markup is recognized. The non-markup
214       sequence of characters is in String. A single non-markup sequence of
215       characters may generate multiple calls to this handler. Whatever the
216       encoding of the string in the original document, this is given to the
217       handler in UTF-8.
218
219       Proc                (Expat, Target, Data)
220
221       This event is generated when a processing instruction is recognized.
222
223       Comment                (Expat, Data)
224
225       This event is generated when a comment is recognized.
226
227       CdataStart        (Expat)
228
229       This is called at the start of a CDATA section.
230
231       CdataEnd                (Expat)
232
233       This is called at the end of a CDATA section.
234
235       Default                (Expat, String)
236
237       This is called for any characters that don't have a registered handler.
238       This includes both characters that are part of markup for which no
239       events are generated (markup declarations) and characters that could
240       generate events, but for which no handler has been registered.
241
242       Whatever the encoding in the original document, the string is returned
243       to the handler in UTF-8.
244
245       Unparsed                (Expat, Entity, Base, Sysid, Pubid, Notation)
246
247       This is called for a declaration of an unparsed entity. Entity is the
248       name of the entity. Base is the base to be used for resolving a rela‐
249       tive URI.  Sysid is the system id. Pubid is the public id. Notation is
250       the notation name. Base and Pubid may be undefined.
251
252       Notation                (Expat, Notation, Base, Sysid, Pubid)
253
254       This is called for a declaration of notation. Notation is the notation
255       name.  Base is the base to be used for resolving a relative URI. Sysid
256       is the system id. Pubid is the public id. Base, Sysid, and Pubid may
257       all be undefined.
258
259       ExternEnt        (Expat, Base, Sysid, Pubid)
260
261       This is called when an external entity is referenced. Base is the base
262       to be used for resolving a relative URI. Sysid is the system id. Pubid
263       is the public id. Base, and Pubid may be undefined.
264
265       This handler should either return a string, which represents the con‐
266       tents of the external entity, or return an open filehandle that can be
267       read to obtain the contents of the external entity, or return undef,
268       which indicates the external entity couldn't be found and will generate
269       a parse error.
270
271       If an open filehandle is returned, it must be returned as either a glob
272       (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle).
273
274       A default handler is installed for this event. The default handler is
275       XML::Parser::lwp_ext_ent_handler unless the NoLWP option was provided
276       with a true value, otherwise XML::Parser::file_ext_ent_handler is the
277       default handler for external entities. Even without the NoLWP option,
278       if the URI or LWP modules are missing, the file based handler ends up
279       being used after giving a warning on the first external entity refer‐
280       ence.
281
282       The LWP external entity handler will use proxies defined in the envi‐
283       ronment (http_proxy, ftp_proxy, etc.).
284
285       Please note that the LWP external entity handler reads the entire
286       entity into a string and returns it, where as the file handler opens a
287       filehandle.
288
289       Also note that the file external entity handler will likely choke on
290       absolute URIs or file names that don't fit the conventions of the local
291       operating system.
292
293       The expat base method can be used to set a basename for relative path‐
294       names. If no basename is given, or if the basename is itself a relative
295       name, then it is relative to the current working directory.
296
297       ExternEntFin        (Expat)
298
299       This is called after parsing an external entity. It's not called unless
300       an ExternEnt handler is also set. There is a default handler installed
301       that pairs with the default ExternEnt handler.
302
303       If you're going to install your own ExternEnt handler, then you should
304       set (or unset) this handler too.
305
306       Entity                (Expat, Name, Val, Sysid, Pubid, Ndata, IsParam)
307
308       This is called when an entity is declared. For internal entities, the
309       Val parameter will contain the value and the remaining three parameters
310       will be undefined. For external entities, the Val parameter will be
311       undefined, the Sysid parameter will have the system id, the Pubid
312       parameter will have the public id if it was provided (it will be unde‐
313       fined otherwise), the Ndata parameter will contain the notation for
314       unparsed entities. If this is a parameter entity declaration, then the
315       IsParam parameter is true.
316
317       Note that this handler and the Unparsed handler above overlap. If both
318       are set, then this handler will not be called for unparsed entities.
319
320       Element                (Expat, Name, Model)
321
322       The element handler is called when an element declaration is found.
323       Name is the element name, and Model is the content model as an
324       XML::Parser::Content object. See "XML::Parser::ContentModel Methods" in
325       XML::Parser::Expat for methods available for this class.
326
327       Attlist                (Expat, Elname, Attname, Type, Default, Fixed)
328
329       This handler is called for each attribute in an ATTLIST declaration.
330       So an ATTLIST declaration that has multiple attributes will generate
331       multiple calls to this handler. The Elname parameter is the name of the
332       element with which the attribute is being associated. The Attname
333       parameter is the name of the attribute. Type is the attribute type,
334       given as a string. Default is the default value, which will either be
335       "#REQUIRED", "#IMPLIED" or a quoted string (i.e. the returned string
336       will begin and end with a quote character).  If Fixed is true, then
337       this is a fixed attribute.
338
339       Doctype                (Expat, Name, Sysid, Pubid, Internal)
340
341       This handler is called for DOCTYPE declarations. Name is the document
342       type name. Sysid is the system id of the document type, if it was pro‐
343       vided, otherwise it's undefined. Pubid is the public id of the document
344       type, which will be undefined if no public id was given. Internal is
345       the internal subset, given as a string. If there was no internal sub‐
346       set, it will be undefined. Internal will contain all whitespace, com‐
347       ments, processing instructions, and declarations seen in the internal
348       subset. The declarations will be there whether or not they have been
349       processed by another handler (except for unparsed entities processed by
350       the Unparsed handler). However, comments and processing instructions
351       will not appear if they've been processed by their respective handlers.
352
353       * DoctypeFin                (Parser)
354
355       This handler is called after parsing of the DOCTYPE declaration has
356       finished, including any internal or external DTD declarations.
357
358       XMLDecl                (Expat, Version, Encoding, Standalone)
359
360       This handler is called for xml declarations. Version is a string con‐
361       taing the version. Encoding is either undefined or contains an encoding
362       string.  Standalone will be either true, false, or undefined if the
363       standalone attribute is yes, no, or not made respectively.
364

STYLES

366       Debug
367
368       This just prints out the document in outline form. Nothing special is
369       returned by parse.
370
371       Subs
372
373       Each time an element starts, a sub by that name in the package speci‐
374       fied by the Pkg option is called with the same parameters that the
375       Start handler gets called with.
376
377       Each time an element ends, a sub with that name appended with an under‐
378       score ("_"), is called with the same parameters that the End handler
379       gets called with.
380
381       Nothing special is returned by parse.
382
383       Tree
384
385       Parse will return a parse tree for the document. Each node in the tree
386       takes the form of a tag, content pair. Text nodes are represented with
387       a pseudo-tag of "0" and the string that is their content. For elements,
388       the content is an array reference. The first item in the array is a
389       (possibly empty) hash reference containing attributes. The remainder of
390       the array is a sequence of tag-content pairs representing the content
391       of the element.
392
393       So for example the result of parsing:
394
395         <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo>
396
397       would be:
398
399                    Tag   Content
400         ==================================================================
401         [foo, [{}, head, [{id => "a"}, 0, "Hello ",  em, [{}, 0, "there"]],
402                     bar, [         {}, 0, "Howdy",  ref, [{}]],
403                       0, "do"
404               ]
405         ]
406
407       The root document "foo", has 3 children: a "head" element, a "bar" ele‐
408       ment and the text "do". After the empty attribute hash, these are rep‐
409       resented in it's contents by 3 tag-content pairs.
410
411       Objects
412
413       This is similar to the Tree style, except that a hash object is created
414       for each element. The corresponding object will be in the class whose
415       name is created by appending "::" and the element name to the package
416       set with the Pkg option. Non-markup text will be in the ::Characters
417       class. The contents of the corresponding object will be in an anonymous
418       array that is the value of the Kids property for that object.
419
420       Stream
421
422       This style also uses the Pkg package. If none of the subs that this
423       style looks for is there, then the effect of parsing with this style is
424       to print a canonical copy of the document without comments or declara‐
425       tions.  All the subs receive as their 1st parameter the Expat instance
426       for the document they're parsing.
427
428       It looks for the following routines:
429
430       * StartDocument
431           Called at the start of the parse .
432
433       * StartTag
434           Called for every start tag with a second parameter of the element
435           type. The $_ variable will contain a copy of the tag and the %_
436           variable will contain attribute values supplied for that element.
437
438       * EndTag
439           Called for every end tag with a second parameter of the element
440           type. The $_ variable will contain a copy of the end tag.
441
442       * Text
443           Called just before start or end tags with accumulated non-markup
444           text in the $_ variable.
445
446       * PI
447           Called for processing instructions. The $_ variable will contain a
448           copy of the PI and the target and data are sent as 2nd and 3rd
449           parameters respectively.
450
451       * EndDocument
452           Called at conclusion of the parse.
453

ENCODINGS

455       XML documents may be encoded in character sets other than Unicode as
456       long as they may be mapped into the Unicode character set. Expat has
457       further restrictions on encodings. Read the xmlparse.h header file in
458       the expat distribution to see details on these restrictions.
459
460       Expat has built-in encodings for: "UTF-8", "ISO-8859-1", "UTF-16", and
461       "US-ASCII". Encodings are set either through the XML declaration encod‐
462       ing attribute or through the ProtocolEncoding option to XML::Parser or
463       XML::Parser::Expat.
464
465       For encodings other than the built-ins, expat calls the function
466       load_encoding in the Expat package with the encoding name. This func‐
467       tion looks for a file in the path list @XML::Parser::Expat::Encod‐
468       ing_Path, that matches the lower-cased name with a '.enc' extension.
469       The first one it finds, it loads.
470
471       If you wish to build your own encoding maps, check out the XML::Encod‐
472       ing module from CPAN.
473

AUTHORS

475       Larry Wall <larry@wall.org> wrote version 1.0.
476
477       Clark Cooper <coopercc@netheaven.com> picked up support, changed the
478       API for this version (2.x), provided documentation, and added some
479       standard package features.
480
481       Matt Sergeant <matt@sergeant.org> is now maintaining XML::Parser
482
483
484
485perl v5.8.8                       2003-08-18                         Parser(3)
Impressum