1Expat(3) User Contributed Perl Documentation Expat(3)
2
3
4
6 XML::Parser::Expat - Lowlevel access to James Clark's expat XML parser
7
9 use XML::Parser::Expat;
10
11 $parser = new XML::Parser::Expat;
12 $parser->setHandlers('Start' => \&sh,
13 'End' => \&eh,
14 'Char' => \&ch);
15 open(FOO, 'info.xml') or die "Couldn't open";
16 $parser->parse(*FOO);
17 close(FOO);
18 # $parser->parse('<foo id="me"> here <em>we</em> go </foo>');
19
20 sub sh
21 {
22 my ($p, $el, %atts) = @_;
23 $p->setHandlers('Char' => \&spec)
24 if ($el eq 'special');
25 ...
26 }
27
28 sub eh
29 {
30 my ($p, $el) = @_;
31 $p->setHandlers('Char' => \&ch) # Special elements won't contain
32 if ($el eq 'special'); # other special elements
33 ...
34 }
35
37 This module provides an interface to James Clark's XML parser, expat.
38 As in expat, a single instance of the parser can only parse one docu‐
39 ment. Calls to parsestring after the first for a given instance will
40 die.
41
42 Expat (and XML::Parser::Expat) are event based. As the parser recog‐
43 nizes parts of the document (say the start or end of an XML element),
44 then any handlers registered for that type of an event are called with
45 suitable parameters.
46
48 new This is a class method, the constructor for XML::Parser::Expat.
49 Options are passed as keyword value pairs. The recognized options
50 are:
51
52 * ProtocolEncoding
53 The protocol encoding name. The default is none. The expat
54 built-in encodings are: "UTF-8", "ISO-8859-1", "UTF-16", and
55 "US-ASCII". Other encodings may be used if they have encoding
56 maps in one of the directories in the @Encoding_Path list. Set‐
57 ting the protocol encoding overrides any encoding in the XML
58 declaration.
59
60 * Namespaces
61 When this option is given with a true value, then the parser
62 does namespace processing. By default, namespace processing is
63 turned off. When it is turned on, the parser consumes xmlns
64 attributes and strips off prefixes from element and attributes
65 names where those prefixes have a defined namespace. A name's
66 namespace can be found using the "namespace" method and two
67 names can be checked for absolute equality with the "eq_name"
68 method.
69
70 * NoExpand
71 Normally, the parser will try to expand references to entities
72 defined in the internal subset. If this option is set to a true
73 value, and a default handler is also set, then the default han‐
74 dler will be called when an entity reference is seen in text.
75 This has no effect if a default handler has not been regis‐
76 tered, and it has no effect on the expansion of entity refer‐
77 ences inside attribute values.
78
79 * Stream_Delimiter
80 This option takes a string value. When this string is found
81 alone on a line while parsing from a stream, then the parse is
82 ended as if it saw an end of file. The intended use is with a
83 stream of xml documents in a MIME multipart format. The string
84 should not contain a trailing newline.
85
86 * ErrorContext
87 When this option is defined, errors are reported in context.
88 The value of ErrorContext should be the number of lines to show
89 on either side of the line in which the error occurred.
90
91 * ParseParamEnt
92 Unless standalone is set to "yes" in the XML declaration, set‐
93 ting this to a true value allows the external DTD to be read,
94 and parameter entities to be parsed and expanded.
95
96 * Base
97 The base to use for relative pathnames or URLs. This can also
98 be done by using the base method.
99
100 setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
101 This method registers handlers for the various events. If no han‐
102 dlers are registered, then a call to parsestring or parsefile will
103 only determine if the corresponding XML document is well formed (by
104 returning without error.) This may be called from within a han‐
105 dler, after the parse has started.
106
107 Setting a handler to something that evaluates to false unsets that
108 handler.
109
110 This method returns a list of type, handler pairs corresponding to
111 the input. The handlers returned are the ones that were in effect
112 before the call to setHandlers.
113
114 The recognized events and the parameters passed to the correspond‐
115 ing handlers are:
116
117 * Start (Parser, Element [, Attr, Val [,...]])
118 This event is generated when an XML start tag is recognized.
119 Parser is an XML::Parser::Expat instance. Element is the name
120 of the XML element that is opened with the start tag. The Attr
121 & Val pairs are generated for each attribute in the start tag.
122
123 * End (Parser, Element)
124 This event is generated when an XML end tag is recognized. Note
125 that an XML empty tag (<foo/>) generates both a start and an
126 end event.
127
128 There is always a lower level start and end handler installed
129 that wrap the corresponding callbacks. This is to handle the
130 context mechanism. A consequence of this is that the default
131 handler (see below) will not see a start tag or end tag unless
132 the default_current method is called.
133
134 * Char (Parser, String)
135 This event is generated when non-markup is recognized. The non-
136 markup sequence of characters is in String. A single non-markup
137 sequence of characters may generate multiple calls to this han‐
138 dler. Whatever the encoding of the string in the original docu‐
139 ment, this is given to the handler in UTF-8.
140
141 * Proc (Parser, Target, Data)
142 This event is generated when a processing instruction is recog‐
143 nized.
144
145 * Comment (Parser, String)
146 This event is generated when a comment is recognized.
147
148 * CdataStart (Parser)
149 This is called at the start of a CDATA section.
150
151 * CdataEnd (Parser)
152 This is called at the end of a CDATA section.
153
154 * Default (Parser, String)
155 This is called for any characters that don't have a registered
156 handler. This includes both characters that are part of markup
157 for which no events are generated (markup declarations) and
158 characters that could generate events, but for which no handler
159 has been registered.
160
161 Whatever the encoding in the original document, the string is
162 returned to the handler in UTF-8.
163
164 * Unparsed (Parser, Entity, Base, Sysid, Pubid, Notation)
165 This is called for a declaration of an unparsed entity. Entity
166 is the name of the entity. Base is the base to be used for
167 resolving a relative URI. Sysid is the system id. Pubid is the
168 public id. Notation is the notation name. Base and Pubid may be
169 undefined.
170
171 * Notation (Parser, Notation, Base, Sysid, Pubid)
172 This is called for a declaration of notation. Notation is the
173 notation name. Base is the base to be used for resolving a
174 relative URI. Sysid is the system id. Pubid is the public id.
175 Base, Sysid, and Pubid may all be undefined.
176
177 * ExternEnt (Parser, Base, Sysid, Pubid)
178 This is called when an external entity is referenced. Base is
179 the base to be used for resolving a relative URI. Sysid is the
180 system id. Pubid is the public id. Base, and Pubid may be unde‐
181 fined.
182
183 This handler should either return a string, which represents
184 the contents of the external entity, or return an open filehan‐
185 dle that can be read to obtain the contents of the external
186 entity, or return undef, which indicates the external entity
187 couldn't be found and will generate a parse error.
188
189 If an open filehandle is returned, it must be returned as
190 either a glob (*FOO) or as a reference to a glob (e.g. an
191 instance of IO::Handle).
192
193 * ExternEntFin (Parser)
194 This is called after an external entity has been parsed. It
195 allows applications to perform cleanup on actions performed in
196 the above ExternEnt handler.
197
198 * Entity (Parser, Name, Val, Sysid, Pubid, Ndata,
199 IsParam)
200 This is called when an entity is declared. For internal enti‐
201 ties, the Val parameter will contain the value and the remain‐
202 ing three parameters will be undefined. For external entities,
203 the Val parameter will be undefined, the Sysid parameter will
204 have the system id, the Pubid parameter will have the public id
205 if it was provided (it will be undefined otherwise), the Ndata
206 parameter will contain the notation for unparsed entities. If
207 this is a parameter entity declaration, then the IsParam param‐
208 eter is true.
209
210 Note that this handler and the Unparsed handler above overlap.
211 If both are set, then this handler will not be called for
212 unparsed entities.
213
214 * Element (Parser, Name, Model)
215 The element handler is called when an element declaration is
216 found. Name is the element name, and Model is the content model
217 as an XML::Parser::ContentModel object. See "XML::Parser::Con‐
218 tentModel Methods" for methods available for this class.
219
220 * Attlist (Parser, Elname, Attname, Type, Default, Fixed)
221 This handler is called for each attribute in an ATTLIST decla‐
222 ration. So an ATTLIST declaration that has multiple attributes
223 will generate multiple calls to this handler. The Elname param‐
224 eter is the name of the element with which the attribute is
225 being associated. The Attname parameter is the name of the
226 attribute. Type is the attribute type, given as a string.
227 Default is the default value, which will either be "#REQUIRED",
228 "#IMPLIED" or a quoted string (i.e. the returned string will
229 begin and end with a quote character). If Fixed is true, then
230 this is a fixed attribute.
231
232 * Doctype (Parser, Name, Sysid, Pubid, Internal)
233 This handler is called for DOCTYPE declarations. Name is the
234 document type name. Sysid is the system id of the document
235 type, if it was provided, otherwise it's undefined. Pubid is
236 the public id of the document type, which will be undefined if
237 no public id was given. Internal will be true or false, indi‐
238 cating whether or not the doctype declaration contains an
239 internal subset.
240
241 * DoctypeFin (Parser)
242 This handler is called after parsing of the DOCTYPE declaration
243 has finished, including any internal or external DTD declara‐
244 tions.
245
246 * XMLDecl (Parser, Version, Encoding, Standalone)
247 This handler is called for XML declarations. Version is a
248 string containg the version. Encoding is either undefined or
249 contains an encoding string. Standalone is either undefined,
250 or true or false. Undefined indicates that no standalone param‐
251 eter was given in the XML declaration. True or false indicates
252 "yes" or "no" respectively.
253
254 namespace(name)
255 Return the URI of the namespace that the name belongs to. If the
256 name doesn't belong to any namespace, an undef is returned. This is
257 only valid on names received through the Start or End handlers from
258 a single document, or through a call to the generate_ns_name
259 method. In other words, don't use names generated from one instance
260 of XML::Parser::Expat with other instances.
261
262 eq_name(name1, name2)
263 Return true if name1 and name2 are identical (i.e. same name and
264 from the same namespace.) This is only meaningful if both names
265 were obtained through the Start or End handlers from a single docu‐
266 ment, or through a call to the generate_ns_name method.
267
268 generate_ns_name(name, namespace)
269 Return a name, associated with a given namespace, good for using
270 with the above 2 methods. The namespace argument should be the
271 namespace URI, not a prefix.
272
273 new_ns_prefixes
274 When called from a start tag handler, returns namespace prefixes
275 declared with this start tag. If called elsewere (or if there were
276 no namespace prefixes declared), it returns an empty list. Setting
277 of the default namespace is indicated with '#default' as a prefix.
278
279 expand_ns_prefix(prefix)
280 Return the uri to which the given prefix is currently bound.
281 Returns undef if the prefix isn't currently bound. Use '#default'
282 to find the current binding of the default namespace (if any).
283
284 current_ns_prefixes
285 Return a list of currently bound namespace prefixes. The order of
286 the the prefixes in the list has no meaning. If the default names‐
287 pace is currently bound, '#default' appears in the list.
288
289 recognized_string
290 Returns the string from the document that was recognized in order
291 to call the current handler. For instance, when called from a start
292 handler, it will give us the the start-tag string. The string is
293 encoded in UTF-8. This method doesn't return a meaningful string
294 inside declaration handlers.
295
296 original_string
297 Returns the verbatim string from the document that was recognized
298 in order to call the current handler. The string is in the original
299 document encoding. This method doesn't return a meaningful string
300 inside declaration handlers.
301
302 default_current
303 When called from a handler, causes the sequence of characters that
304 generated the corresponding event to be sent to the default handler
305 (if one is registered). Use of this method is deprecated in favor
306 the recognized_string method, which you can use without installing
307 a default handler. This method doesn't deliver a meaningful string
308 to the default handler when called from inside declaration han‐
309 dlers.
310
311 xpcroak(message)
312 Concatenate onto the given message the current line number within
313 the XML document plus the message implied by ErrorContext. Then
314 croak with the formed message.
315
316 xpcarp(message)
317 Concatenate onto the given message the current line number within
318 the XML document plus the message implied by ErrorContext. Then
319 carp with the formed message.
320
321 current_line
322 Returns the line number of the current position of the parse.
323
324 current_column
325 Returns the column number of the current position of the parse.
326
327 current_byte
328 Returns the current position of the parse.
329
330 base([NEWBASE]);
331 Returns the current value of the base for resolving relative URIs.
332 If NEWBASE is supplied, changes the base to that value.
333
334 context
335 Returns a list of element names that represent open elements, with
336 the last one being the innermost. Inside start and end tag han‐
337 dlers, this will be the tag of the parent element.
338
339 current_element
340 Returns the name of the innermost currently opened element. Inside
341 start or end handlers, returns the parent of the element associated
342 with those tags.
343
344 in_element(NAME)
345 Returns true if NAME is equal to the name of the innermost cur‐
346 rently opened element. If namespace processing is being used and
347 you want to check against a name that may be in a namespace, then
348 use the generate_ns_name method to create the NAME argument.
349
350 within_element(NAME)
351 Returns the number of times the given name appears in the context
352 list. If namespace processing is being used and you want to check
353 against a name that may be in a namespace, then use the gener‐
354 ate_ns_name method to create the NAME argument.
355
356 depth
357 Returns the size of the context list.
358
359 element_index
360 Returns an integer that is the depth-first visit order of the cur‐
361 rent element. This will be zero outside of the root element. For
362 example, this will return 1 when called from the start handler for
363 the root element start tag.
364
365 skip_until(INDEX)
366 INDEX is an integer that represents an element index. When this
367 method is called, all handlers are suspended until the start tag
368 for an element that has an index number equal to INDEX is seen. If
369 a start handler has been set, then this is the first tag that the
370 start handler will see after skip_until has been called.
371
372 position_in_context(LINES)
373 Returns a string that shows the current parse position. LINES
374 should be an integer >= 0 that represents the number of lines on
375 either side of the current parse line to place into the returned
376 string.
377
378 xml_escape(TEXT [, CHAR [, CHAR ...]])
379 Returns TEXT with markup characters turned into character entities.
380 Any additional characters provided as arguments are also turned
381 into character references where found in TEXT.
382
383 parse (SOURCE)
384 The SOURCE parameter should either be a string containing the whole
385 XML document, or it should be an open IO::Handle. Only a single
386 document may be parsed for a given instance of XML::Parser::Expat,
387 so this will croak if it's been called previously for this
388 instance.
389
390 parsestring(XML_DOC_STRING)
391 Parses the given string as an XML document. Only a single document
392 may be parsed for a given instance of XML::Parser::Expat, so this
393 will die if either parsestring or parsefile has been called for
394 this instance previously.
395
396 This method is deprecated in favor of the parse method.
397
398 parsefile(FILENAME)
399 Parses the XML document in the given file. Will die if parsestring
400 or parsefile has been called previously for this instance.
401
402 is_defaulted(ATTNAME)
403 NO LONGER WORKS. To find out if an attribute is defaulted please
404 use the specified_attr method.
405
406 specified_attr
407 When the start handler receives lists of attributes and values, the
408 non-defaulted (i.e. explicitly specified) attributes occur in the
409 list first. This method returns the number of specified items in
410 the list. So if this number is equal to the length of the list,
411 there were no defaulted values. Otherwise the number points to the
412 index of the first defaulted attribute name.
413
414 finish
415 Unsets all handlers (including internal ones that set context), but
416 expat continues parsing to the end of the document or until it
417 finds an error. It should finish up a lot faster than with the
418 handlers set.
419
420 release
421 There are data structures used by XML::Parser::Expat that have cir‐
422 cular references. This means that these structures will never be
423 garbage collected unless these references are explicitly broken.
424 Calling this method breaks those references (and makes the instance
425 unusable.)
426
427 Normally, higher level calls handle this for you, but if you are
428 using XML::Parser::Expat directly, then it's your responsibility to
429 call it.
430
431 XML::Parser::ContentModel Methods
432
433 The element declaration handlers are passed objects of this class as
434 the content model of the element declaration. They also represent con‐
435 tent particles, components of a content model.
436
437 When referred to as a string, these objects are automagicly converted
438 to a string representation of the model (or content particle).
439
440 isempty
441 This method returns true if the object is "EMPTY", false otherwise.
442
443 isany
444 This method returns true if the object is "ANY", false otherwise.
445
446 ismixed
447 This method returns true if the object is "(#PCDATA)" or
448 "(#PCDATA⎪...)*", false otherwise.
449
450 isname
451 This method returns if the object is an element name.
452
453 ischoice
454 This method returns true if the object is a choice of content par‐
455 ticles.
456
457 isseq
458 This method returns true if the object is a sequence of content
459 particles.
460
461 quant
462 This method returns undef or a string representing the quantifier
463 ('?', '*', '+') associated with the model or particle.
464
465 children
466 This method returns undef or (for mixed, choice, and sequence
467 types) an array of component content particles. There will always
468 be at least one component for choices and sequences, but for a
469 mixed content model of pure PCDATA, "(#PCDATA)", then an undef is
470 returned.
471
472 XML::Parser::ExpatNB Methods
473
474 The class XML::Parser::ExpatNB is a subclass of XML::Parser::Expat used
475 for non-blocking access to the expat library. It does not support the
476 parse, parsestring, or parsefile methods, but it does have these addi‐
477 tional methods:
478
479 parse_more(DATA)
480 Feed expat more text to munch on.
481
482 parse_done
483 Tell expat that it's gotten the whole document.
484
486 XML::Parser::Expat::load_encoding(ENCODING)
487 Load an external encoding. ENCODING is either the name of an encod‐
488 ing or the name of a file. The basename is converted to lowercase
489 and a '.enc' extension is appended unless there's one already
490 there. Then, unless it's an absolute pathname (i.e. begins with
491 '/'), the first file by that name discovered in the @Encoding_Path
492 path list is used.
493
494 The encoding in the file is loaded and kept in the %Encoding_Table
495 table. Earlier encodings of the same name are replaced.
496
497 This function is automaticly called by expat when it encounters an
498 encoding it doesn't know about. Expat shouldn't call this twice for
499 the same encoding name. The only reason users should use this func‐
500 tion is to explicitly load an encoding not contained in the @Encod‐
501 ing_Path list.
502
504 Larry Wall <larry@wall.org> wrote version 1.0.
505
506 Clark Cooper <coopercc@netheaven.com> picked up support, changed the
507 API for this version (2.x), provided documentation, and added some
508 standard package features.
509
510
511
512perl v5.8.8 2003-08-18 Expat(3)