1Parser(3)             User Contributed Perl Documentation            Parser(3)
2
3
4

NAME

6       XML::Parser - A perl module for parsing XML documents
7

SYNOPSIS

9         use XML::Parser;
10
11         $p1 = XML::Parser->new(Style => 'Debug');
12         $p1->parsefile('REC-xml-19980210.xml');
13         $p1->parse('<foo id="me">Hello World</foo>');
14
15         # Alternative
16         $p2 = XML::Parser->new(Handlers => {Start => \&handle_start,
17                                            End   => \&handle_end,
18                                            Char  => \&handle_char});
19         $p2->parse($socket);
20
21         # Another alternative
22         $p3 = XML::Parser->new(ErrorContext => 2);
23
24         $p3->setHandlers(Char    => \&text,
25                          Default => \&other);
26
27         open(my $fh, 'xmlgenerator |');
28         $p3->parse($foo, ProtocolEncoding => 'ISO-8859-1');
29         close($foo);
30
31         $p3->parsefile('junk.xml', ErrorContext => 3);
32

DESCRIPTION

34       This module provides ways to parse XML documents. It is built on top of
35       XML::Parser::Expat, which is a lower level interface to James Clark's
36       expat library. Each call to one of the parsing methods creates a new
37       instance of XML::Parser::Expat which is then used to parse the
38       document.  Expat options may be provided when the XML::Parser object is
39       created.  These options are then passed on to the Expat object on each
40       parse call.  They can also be given as extra arguments to the parse
41       methods, in which case they override options given at XML::Parser
42       creation time.
43
44       The behavior of the parser is controlled either by "STYLES" and/or
45       "HANDLERS" options, or by "setHandlers" method. These all provide
46       mechanisms for XML::Parser to set the handlers needed by
47       XML::Parser::Expat.  If neither "Style" nor "Handlers" are specified,
48       then parsing just checks the document for being well-formed.
49
50       When underlying handlers get called, they receive as their first
51       parameter the Expat object, not the Parser object.
52

METHODS

54       new This is a class method, the constructor for XML::Parser. Options
55           are passed as keyword value pairs. Recognized options are:
56
57           •   Style
58
59               This option provides an easy way to create a given style of
60               parser. The built in styles are: "Debug", "Subs", "Tree",
61               "Objects", and "Stream". These are all defined in separate
62               packages under "XML::Parser::Style::*", and you can find
63               further documentation for each style both below, and in those
64               packages.
65
66               Custom styles can be provided by giving a full package name
67               containing at least one '::'. This package should then have
68               subs defined for each handler it wishes to have installed. See
69               "STYLES" below for a discussion of each built in style.
70
71           •   Handlers
72
73               When provided, this option should be an anonymous hash
74               containing as keys the type of handler and as values a sub
75               reference to handle that type of event. All the handlers get
76               passed as their 1st parameter the instance of expat that is
77               parsing the document. Further details on handlers can be found
78               in "HANDLERS". Any handler set here overrides the corresponding
79               handler set with the Style option.
80
81           •   Pkg
82
83               Some styles will refer to subs defined in this package. If not
84               provided, it defaults to the package which called the
85               constructor.
86
87           •   ErrorContext
88
89               This is an Expat option. When this option is defined, errors
90               are reported in context. The value should be the number of
91               lines to show on either side of the line in which the error
92               occurred.
93
94           •   ProtocolEncoding
95
96               This is an Expat option. This sets the protocol encoding name.
97               It defaults to none. The built-in encodings are: "UTF-8",
98               "ISO-8859-1", "UTF-16", and "US-ASCII". Other encodings may be
99               used if they have encoding maps in one of the directories in
100               the @Encoding_Path list. Check "ENCODINGS" for more information
101               on encoding maps. Setting the protocol encoding overrides any
102               encoding in the XML declaration.
103
104           •   Namespaces
105
106               This is an Expat option. If this is set to a true value, then
107               namespace processing is done during the parse. See "Namespaces"
108               in XML::Parser::Expat for further discussion of namespace
109               processing.
110
111           •   NoExpand
112
113               This is an Expat option. Normally, the parser will try to
114               expand references to entities defined in the internal subset.
115               If this option is set to a true value, and a default handler is
116               also set, then the default handler will be called when an
117               entity reference is seen in text. This has no effect if a
118               default handler has not been registered, and it has no effect
119               on the expansion of entity references inside attribute values.
120
121           •   Stream_Delimiter
122
123               This is an Expat option. It takes a string value. When this
124               string is found alone on a line while parsing from a stream,
125               then the parse is ended as if it saw an end of file. The
126               intended use is with a stream of xml documents in a MIME
127               multipart format. The string should not contain a trailing
128               newline.
129
130           •   ParseParamEnt
131
132               This is an Expat option. Unless standalone is set to "yes" in
133               the XML declaration, setting this to a true value allows the
134               external DTD to be read, and parameter entities to be parsed
135               and expanded.
136
137           •   NoLWP
138
139               This option has no effect if the ExternEnt or ExternEntFin
140               handlers are directly set. Otherwise, if true, it forces the
141               use of a file based external entity handler.
142
143           •   Non_Expat_Options
144
145               If provided, this should be an anonymous hash whose keys are
146               options that shouldn't be passed to Expat. This should only be
147               of concern to those subclassing XML::Parser.
148
149       setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
150           This method registers handlers for various parser events. It
151           overrides any previous handlers registered through the Style or
152           Handler options or through earlier calls to setHandlers. By
153           providing a false or undefined value as the handler, the existing
154           handler can be unset.
155
156           This method returns a list of type, handler pairs corresponding to
157           the input. The handlers returned are the ones that were in effect
158           prior to the call.
159
160           See a description of the handler types in "HANDLERS".
161
162       parse(SOURCE [, OPT => OPT_VALUE [...]])
163           The SOURCE parameter should either be a string containing the whole
164           XML document, or it should be an open IO::Handle. Constructor
165           options to XML::Parser::Expat given as keyword-value pairs may
166           follow the SOURCE parameter. These override, for this call, any
167           options or attributes passed through from the XML::Parser instance.
168
169           A die call is thrown if a parse error occurs. Otherwise it will
170           return 1 or whatever is returned from the Final handler, if one is
171           installed.  In other words, what parse may return depends on the
172           style.
173
174       parsestring
175           This is just an alias for parse for backwards compatibility.
176
177       parsefile(FILE [, OPT => OPT_VALUE [...]])
178           Open FILE for reading, then call parse with the open handle. The
179           file is closed no matter how parse returns. Returns what parse
180           returns.
181
182       parse_start([ OPT => OPT_VALUE [...]])
183           Create and return a new instance of XML::Parser::ExpatNB.
184           Constructor options may be provided. If an init handler has been
185           provided, it is called before returning the ExpatNB object.
186           Documents are parsed by making incremental calls to the parse_more
187           method of this object, which takes a string. A single call to the
188           parse_done method of this object, which takes no arguments,
189           indicates that the document is finished.
190
191           If there is a final handler installed, it is executed by the
192           parse_done method before returning and the parse_done method
193           returns whatever is returned by the final handler.
194

HANDLERS

196       Expat is an event based parser. As the parser recognizes parts of the
197       document (say the start or end tag for an XML element), then any
198       handlers registered for that type of an event are called with suitable
199       parameters.  All handlers receive an instance of XML::Parser::Expat as
200       their first argument. See "METHODS" in XML::Parser::Expat for a
201       discussion of the methods that can be called on this object.
202
203   Init                (Expat)
204       This is called just before the parsing of the document starts.
205
206   Final                (Expat)
207       This is called just after parsing has finished, but only if no errors
208       occurred during the parse. Parse returns what this returns.
209
210   Start                (Expat, Element [, Attr, Val [,...]])
211       This event is generated when an XML start tag is recognized. Element is
212       the name of the XML element type that is opened with the start tag. The
213       Attr & Val pairs are generated for each attribute in the start tag.
214
215   End                (Expat, Element)
216       This event is generated when an XML end tag is recognized. Note that an
217       XML empty tag (<foo/>) generates both a start and an end event.
218
219   Char                (Expat, String)
220       This event is generated when non-markup is recognized. The non-markup
221       sequence of characters is in String. A single non-markup sequence of
222       characters may generate multiple calls to this handler. Whatever the
223       encoding of the string in the original document, this is given to the
224       handler in UTF-8.
225
226   Proc                (Expat, Target, Data)
227       This event is generated when a processing instruction is recognized.
228
229   Comment                (Expat, Data)
230       This event is generated when a comment is recognized.
231
232   CdataStart        (Expat)
233       This is called at the start of a CDATA section.
234
235   CdataEnd                (Expat)
236       This is called at the end of a CDATA section.
237
238   Default                (Expat, String)
239       This is called for any characters that don't have a registered handler.
240       This includes both characters that are part of markup for which no
241       events are generated (markup declarations) and characters that could
242       generate events, but for which no handler has been registered.
243
244       Whatever the encoding in the original document, the string is returned
245       to the handler in UTF-8.
246
247   Unparsed                (Expat, Entity, Base, Sysid, Pubid, Notation)
248       This is called for a declaration of an unparsed entity. Entity is the
249       name of the entity. Base is the base to be used for resolving a
250       relative URI.  Sysid is the system id. Pubid is the public id. Notation
251       is the notation name. Base and Pubid may be undefined.
252
253   Notation                (Expat, Notation, Base, Sysid, Pubid)
254       This is called for a declaration of notation. Notation is the notation
255       name.  Base is the base to be used for resolving a relative URI. Sysid
256       is the system id. Pubid is the public id. Base, Sysid, and Pubid may
257       all be undefined.
258
259   ExternEnt        (Expat, Base, Sysid, Pubid)
260       This is called when an external entity is referenced. Base is the base
261       to be used for resolving a relative URI. Sysid is the system id. Pubid
262       is the public id. Base, and Pubid may be undefined.
263
264       This handler should either return a string, which represents the
265       contents of the external entity, or return an open filehandle that can
266       be read to obtain the contents of the external entity, or return undef,
267       which indicates the external entity couldn't be found and will generate
268       a parse error.
269
270       If an open filehandle is returned, it must be returned as either a glob
271       (*FOO) or as a reference to a glob (e.g. an instance of IO::Handle).
272
273       A default handler is installed for this event. The default handler is
274       XML::Parser::lwp_ext_ent_handler unless the NoLWP option was provided
275       with a true value, otherwise XML::Parser::file_ext_ent_handler is the
276       default handler for external entities. Even without the NoLWP option,
277       if the URI or LWP modules are missing, the file based handler ends up
278       being used after giving a warning on the first external entity
279       reference.
280
281       The LWP external entity handler will use proxies defined in the
282       environment (http_proxy, ftp_proxy, etc.).
283
284       Please note that the LWP external entity handler reads the entire
285       entity into a string and returns it, where as the file handler opens a
286       filehandle.
287
288       Also note that the file external entity handler will likely choke on
289       absolute URIs or file names that don't fit the conventions of the local
290       operating system.
291
292       The expat base method can be used to set a basename for relative
293       pathnames. If no basename is given, or if the basename is itself a
294       relative name, then it is relative to the current working directory.
295
296   ExternEntFin        (Expat)
297       This is called after parsing an external entity. It's not called unless
298       an ExternEnt handler is also set. There is a default handler installed
299       that pairs with the default ExternEnt handler.
300
301       If you're going to install your own ExternEnt handler, then you should
302       set (or unset) this handler too.
303
304   Entity                (Expat, Name, Val, Sysid, Pubid, Ndata, IsParam)
305       This is called when an entity is declared. For internal entities, the
306       Val parameter will contain the value and the remaining three parameters
307       will be undefined. For external entities, the Val parameter will be
308       undefined, the Sysid parameter will have the system id, the Pubid
309       parameter will have the public id if it was provided (it will be
310       undefined otherwise), the Ndata parameter will contain the notation for
311       unparsed entities. If this is a parameter entity declaration, then the
312       IsParam parameter is true.
313
314       Note that this handler and the Unparsed handler above overlap. If both
315       are set, then this handler will not be called for unparsed entities.
316
317   Element                (Expat, Name, Model)
318       The element handler is called when an element declaration is found.
319       Name is the element name, and Model is the content model as an
320       XML::Parser::Content object. See "XML::Parser::ContentModel Methods" in
321       XML::Parser::Expat for methods available for this class.
322
323   Attlist                (Expat, Elname, Attname, Type, Default, Fixed)
324       This handler is called for each attribute in an ATTLIST declaration.
325       So an ATTLIST declaration that has multiple attributes will generate
326       multiple calls to this handler. The Elname parameter is the name of the
327       element with which the attribute is being associated. The Attname
328       parameter is the name of the attribute. Type is the attribute type,
329       given as a string. Default is the default value, which will either be
330       "#REQUIRED", "#IMPLIED" or a quoted string (i.e. the returned string
331       will begin and end with a quote character).  If Fixed is true, then
332       this is a fixed attribute.
333
334   Doctype                (Expat, Name, Sysid, Pubid, Internal)
335       This handler is called for DOCTYPE declarations. Name is the document
336       type name. Sysid is the system id of the document type, if it was
337       provided, otherwise it's undefined. Pubid is the public id of the
338       document type, which will be undefined if no public id was given.
339       Internal is the internal subset, given as a string. If there was no
340       internal subset, it will be undefined. Internal will contain all
341       whitespace, comments, processing instructions, and declarations seen in
342       the internal subset. The declarations will be there whether or not they
343       have been processed by another handler (except for unparsed entities
344       processed by the Unparsed handler). However, comments and processing
345       instructions will not appear if they've been processed by their
346       respective handlers.
347
348   * DoctypeFin                (Parser)
349       This handler is called after parsing of the DOCTYPE declaration has
350       finished, including any internal or external DTD declarations.
351
352   XMLDecl                (Expat, Version, Encoding, Standalone)
353       This handler is called for xml declarations. Version is a string
354       containing the version. Encoding is either undefined or contains an
355       encoding string.  Standalone will be either true, false, or undefined
356       if the standalone attribute is yes, no, or not made respectively.
357

STYLES

359   Debug
360       This just prints out the document in outline form. Nothing special is
361       returned by parse.
362
363   Subs
364       Each time an element starts, a sub by that name in the package
365       specified by the Pkg option is called with the same parameters that the
366       Start handler gets called with.
367
368       Each time an element ends, a sub with that name appended with an
369       underscore ("_"), is called with the same parameters that the End
370       handler gets called with.
371
372       Nothing special is returned by parse.
373
374   Tree
375       Parse will return a parse tree for the document. Each node in the tree
376       takes the form of a tag, content pair. Text nodes are represented with
377       a pseudo-tag of "0" and the string that is their content. For elements,
378       the content is an array reference. The first item in the array is a
379       (possibly empty) hash reference containing attributes. The remainder of
380       the array is a sequence of tag-content pairs representing the content
381       of the element.
382
383       So for example the result of parsing:
384
385         <foo><head id="a">Hello <em>there</em></head><bar>Howdy<ref/></bar>do</foo>
386
387       would be:
388
389                    Tag   Content
390         ==================================================================
391         [foo, [{}, head, [{id => "a"}, 0, "Hello ",  em, [{}, 0, "there"]],
392                     bar, [         {}, 0, "Howdy",  ref, [{}]],
393                       0, "do"
394               ]
395         ]
396
397       The root document "foo", has 3 children: a "head" element, a "bar"
398       element and the text "do". After the empty attribute hash, these are
399       represented in it's contents by 3 tag-content pairs.
400
401   Objects
402       This is similar to the Tree style, except that a hash object is created
403       for each element. The corresponding object will be in the class whose
404       name is created by appending "::" and the element name to the package
405       set with the Pkg option. Non-markup text will be in the ::Characters
406       class. The contents of the corresponding object will be in an anonymous
407       array that is the value of the Kids property for that object.
408
409   Stream
410       This style also uses the Pkg package. If none of the subs that this
411       style looks for is there, then the effect of parsing with this style is
412       to print a canonical copy of the document without comments or
413       declarations.  All the subs receive as their 1st parameter the Expat
414       instance for the document they're parsing.
415
416       It looks for the following routines:
417
418       •   StartDocument
419
420           Called at the start of the parse .
421
422       •   StartTag
423
424           Called for every start tag with a second parameter of the element
425           type. The $_ variable will contain a copy of the tag and the %_
426           variable will contain attribute values supplied for that element.
427
428       •   EndTag
429
430           Called for every end tag with a second parameter of the element
431           type. The $_ variable will contain a copy of the end tag.
432
433       •   Text
434
435           Called just before start or end tags with accumulated non-markup
436           text in the $_ variable.
437
438       •   PI
439
440           Called for processing instructions. The $_ variable will contain a
441           copy of the PI and the target and data are sent as 2nd and 3rd
442           parameters respectively.
443
444       •   EndDocument
445
446           Called at conclusion of the parse.
447

ENCODINGS

449       XML documents may be encoded in character sets other than Unicode as
450       long as they may be mapped into the Unicode character set. Expat has
451       further restrictions on encodings. Read the xmlparse.h header file in
452       the expat distribution to see details on these restrictions.
453
454       Expat has built-in encodings for: "UTF-8", "ISO-8859-1", "UTF-16", and
455       "US-ASCII". Encodings are set either through the XML declaration
456       encoding attribute or through the ProtocolEncoding option to
457       XML::Parser or XML::Parser::Expat.
458
459       For encodings other than the built-ins, expat calls the function
460       load_encoding in the Expat package with the encoding name. This
461       function looks for a file in the path list
462       @XML::Parser::Expat::Encoding_Path, that matches the lower-cased name
463       with a '.enc' extension. The first one it finds, it loads.
464
465       If you wish to build your own encoding maps, check out the
466       XML::Encoding module from CPAN.
467

AUTHORS

469       Larry Wall <larry@wall.org> wrote version 1.0.
470
471       Clark Cooper <coopercc@netheaven.com> picked up support, changed the
472       API for this version (2.x), provided documentation, and added some
473       standard package features.
474
475       Matt Sergeant <matt@sergeant.org> is now maintaining XML::Parser
476
477
478
479perl v5.38.0                      2023-07-21                         Parser(3)
Impressum