1Expat(3) User Contributed Perl Documentation Expat(3)
2
3
4
6 XML::Parser::Expat - Lowlevel access to James Clark's expat XML parser
7
9 use XML::Parser::Expat;
10
11 $parser = XML::Parser::Expat->new;
12 $parser->setHandlers('Start' => \&sh,
13 'End' => \&eh,
14 'Char' => \&ch);
15 open(FOO, '<', 'info.xml') or die "Couldn't open";
16 $parser->parse(*FOO);
17 close(FOO);
18 # $parser->parse('<foo id="me"> here <em>we</em> go </foo>');
19
20 sub sh
21 {
22 my ($p, $el, %atts) = @_;
23 $p->setHandlers('Char' => \&spec)
24 if ($el eq 'special');
25 ...
26 }
27
28 sub eh
29 {
30 my ($p, $el) = @_;
31 $p->setHandlers('Char' => \&ch) # Special elements won't contain
32 if ($el eq 'special'); # other special elements
33 ...
34 }
35
37 This module provides an interface to James Clark's XML parser, expat.
38 As in expat, a single instance of the parser can only parse one
39 document. Calls to parsestring after the first for a given instance
40 will die.
41
42 Expat (and XML::Parser::Expat) are event based. As the parser
43 recognizes parts of the document (say the start or end of an XML
44 element), then any handlers registered for that type of an event are
45 called with suitable parameters.
46
48 new This is a class method, the constructor for XML::Parser::Expat.
49 Options are passed as keyword value pairs. The recognized options
50 are:
51
52 · ProtocolEncoding
53
54 The protocol encoding name. The default is none. The expat
55 built-in encodings are: "UTF-8", "ISO-8859-1", "UTF-16", and
56 "US-ASCII". Other encodings may be used if they have encoding
57 maps in one of the directories in the @Encoding_Path list.
58 Setting the protocol encoding overrides any encoding in the XML
59 declaration.
60
61 · Namespaces
62
63 When this option is given with a true value, then the parser
64 does namespace processing. By default, namespace processing is
65 turned off. When it is turned on, the parser consumes xmlns
66 attributes and strips off prefixes from element and attributes
67 names where those prefixes have a defined namespace. A name's
68 namespace can be found using the "namespace" method and two
69 names can be checked for absolute equality with the "eq_name"
70 method.
71
72 · NoExpand
73
74 Normally, the parser will try to expand references to entities
75 defined in the internal subset. If this option is set to a true
76 value, and a default handler is also set, then the default
77 handler will be called when an entity reference is seen in
78 text. This has no effect if a default handler has not been
79 registered, and it has no effect on the expansion of entity
80 references inside attribute values.
81
82 · Stream_Delimiter
83
84 This option takes a string value. When this string is found
85 alone on a line while parsing from a stream, then the parse is
86 ended as if it saw an end of file. The intended use is with a
87 stream of xml documents in a MIME multipart format. The string
88 should not contain a trailing newline.
89
90 · ErrorContext
91
92 When this option is defined, errors are reported in context.
93 The value of ErrorContext should be the number of lines to show
94 on either side of the line in which the error occurred.
95
96 · ParseParamEnt
97
98 Unless standalone is set to "yes" in the XML declaration,
99 setting this to a true value allows the external DTD to be
100 read, and parameter entities to be parsed and expanded.
101
102 · Base
103
104 The base to use for relative pathnames or URLs. This can also
105 be done by using the base method.
106
107 setHandlers(TYPE, HANDLER [, TYPE, HANDLER [...]])
108 This method registers handlers for the various events. If no
109 handlers are registered, then a call to parsestring or parsefile
110 will only determine if the corresponding XML document is well
111 formed (by returning without error.) This may be called from
112 within a handler, after the parse has started.
113
114 Setting a handler to something that evaluates to false unsets that
115 handler.
116
117 This method returns a list of type, handler pairs corresponding to
118 the input. The handlers returned are the ones that were in effect
119 before the call to setHandlers.
120
121 The recognized events and the parameters passed to the
122 corresponding handlers are:
123
124 · Start (Parser, Element [, Attr, Val [,...]])
125
126 This event is generated when an XML start tag is recognized.
127 Parser is an XML::Parser::Expat instance. Element is the name
128 of the XML element that is opened with the start tag. The Attr
129 & Val pairs are generated for each attribute in the start tag.
130
131 · End (Parser, Element)
132
133 This event is generated when an XML end tag is recognized. Note
134 that an XML empty tag (<foo/>) generates both a start and an
135 end event.
136
137 There is always a lower level start and end handler installed
138 that wrap the corresponding callbacks. This is to handle the
139 context mechanism. A consequence of this is that the default
140 handler (see below) will not see a start tag or end tag unless
141 the default_current method is called.
142
143 · Char (Parser, String)
144
145 This event is generated when non-markup is recognized. The non-
146 markup sequence of characters is in String. A single non-markup
147 sequence of characters may generate multiple calls to this
148 handler. Whatever the encoding of the string in the original
149 document, this is given to the handler in UTF-8.
150
151 · Proc (Parser, Target, Data)
152
153 This event is generated when a processing instruction is
154 recognized.
155
156 · Comment (Parser, String)
157
158 This event is generated when a comment is recognized.
159
160 · CdataStart (Parser)
161
162 This is called at the start of a CDATA section.
163
164 · CdataEnd (Parser)
165
166 This is called at the end of a CDATA section.
167
168 · Default (Parser, String)
169
170 This is called for any characters that don't have a registered
171 handler. This includes both characters that are part of markup
172 for which no events are generated (markup declarations) and
173 characters that could generate events, but for which no handler
174 has been registered.
175
176 Whatever the encoding in the original document, the string is
177 returned to the handler in UTF-8.
178
179 · Unparsed (Parser, Entity, Base, Sysid, Pubid,
180 Notation)
181
182 This is called for a declaration of an unparsed entity. Entity
183 is the name of the entity. Base is the base to be used for
184 resolving a relative URI. Sysid is the system id. Pubid is the
185 public id. Notation is the notation name. Base and Pubid may be
186 undefined.
187
188 · Notation (Parser, Notation, Base, Sysid, Pubid)
189
190 This is called for a declaration of notation. Notation is the
191 notation name. Base is the base to be used for resolving a
192 relative URI. Sysid is the system id. Pubid is the public id.
193 Base, Sysid, and Pubid may all be undefined.
194
195 · ExternEnt (Parser, Base, Sysid, Pubid)
196
197 This is called when an external entity is referenced. Base is
198 the base to be used for resolving a relative URI. Sysid is the
199 system id. Pubid is the public id. Base, and Pubid may be
200 undefined.
201
202 This handler should either return a string, which represents
203 the contents of the external entity, or return an open
204 filehandle that can be read to obtain the contents of the
205 external entity, or return undef, which indicates the external
206 entity couldn't be found and will generate a parse error.
207
208 If an open filehandle is returned, it must be returned as
209 either a glob (*FOO) or as a reference to a glob (e.g. an
210 instance of IO::Handle).
211
212 · ExternEntFin (Parser)
213
214 This is called after an external entity has been parsed. It
215 allows applications to perform cleanup on actions performed in
216 the above ExternEnt handler.
217
218 · Entity (Parser, Name, Val, Sysid, Pubid, Ndata,
219 IsParam)
220
221 This is called when an entity is declared. For internal
222 entities, the Val parameter will contain the value and the
223 remaining three parameters will be undefined. For external
224 entities, the Val parameter will be undefined, the Sysid
225 parameter will have the system id, the Pubid parameter will
226 have the public id if it was provided (it will be undefined
227 otherwise), the Ndata parameter will contain the notation for
228 unparsed entities. If this is a parameter entity declaration,
229 then the IsParam parameter is true.
230
231 Note that this handler and the Unparsed handler above overlap.
232 If both are set, then this handler will not be called for
233 unparsed entities.
234
235 · Element (Parser, Name, Model)
236
237 The element handler is called when an element declaration is
238 found. Name is the element name, and Model is the content model
239 as an XML::Parser::ContentModel object. See
240 "XML::Parser::ContentModel Methods" for methods available for
241 this class.
242
243 · Attlist (Parser, Elname, Attname, Type, Default,
244 Fixed)
245
246 This handler is called for each attribute in an ATTLIST
247 declaration. So an ATTLIST declaration that has multiple
248 attributes will generate multiple calls to this handler. The
249 Elname parameter is the name of the element with which the
250 attribute is being associated. The Attname parameter is the
251 name of the attribute. Type is the attribute type, given as a
252 string. Default is the default value, which will either be
253 "#REQUIRED", "#IMPLIED" or a quoted string (i.e. the returned
254 string will begin and end with a quote character). If Fixed is
255 true, then this is a fixed attribute.
256
257 · Doctype (Parser, Name, Sysid, Pubid, Internal)
258
259 This handler is called for DOCTYPE declarations. Name is the
260 document type name. Sysid is the system id of the document
261 type, if it was provided, otherwise it's undefined. Pubid is
262 the public id of the document type, which will be undefined if
263 no public id was given. Internal will be true or false,
264 indicating whether or not the doctype declaration contains an
265 internal subset.
266
267 · DoctypeFin (Parser)
268
269 This handler is called after parsing of the DOCTYPE declaration
270 has finished, including any internal or external DTD
271 declarations.
272
273 · XMLDecl (Parser, Version, Encoding, Standalone)
274
275 This handler is called for XML declarations. Version is a
276 string containing the version. Encoding is either undefined or
277 contains an encoding string. Standalone is either undefined,
278 or true or false. Undefined indicates that no standalone
279 parameter was given in the XML declaration. True or false
280 indicates "yes" or "no" respectively.
281
282 namespace(name)
283 Return the URI of the namespace that the name belongs to. If the
284 name doesn't belong to any namespace, an undef is returned. This is
285 only valid on names received through the Start or End handlers from
286 a single document, or through a call to the generate_ns_name
287 method. In other words, don't use names generated from one instance
288 of XML::Parser::Expat with other instances.
289
290 eq_name(name1, name2)
291 Return true if name1 and name2 are identical (i.e. same name and
292 from the same namespace.) This is only meaningful if both names
293 were obtained through the Start or End handlers from a single
294 document, or through a call to the generate_ns_name method.
295
296 generate_ns_name(name, namespace)
297 Return a name, associated with a given namespace, good for using
298 with the above 2 methods. The namespace argument should be the
299 namespace URI, not a prefix.
300
301 new_ns_prefixes
302 When called from a start tag handler, returns namespace prefixes
303 declared with this start tag. If called elsewhere (or if there were
304 no namespace prefixes declared), it returns an empty list. Setting
305 of the default namespace is indicated with '#default' as a prefix.
306
307 expand_ns_prefix(prefix)
308 Return the uri to which the given prefix is currently bound.
309 Returns undef if the prefix isn't currently bound. Use '#default'
310 to find the current binding of the default namespace (if any).
311
312 current_ns_prefixes
313 Return a list of currently bound namespace prefixes. The order of
314 the the prefixes in the list has no meaning. If the default
315 namespace is currently bound, '#default' appears in the list.
316
317 recognized_string
318 Returns the string from the document that was recognized in order
319 to call the current handler. For instance, when called from a start
320 handler, it will give us the start-tag string. The string is
321 encoded in UTF-8. This method doesn't return a meaningful string
322 inside declaration handlers.
323
324 original_string
325 Returns the verbatim string from the document that was recognized
326 in order to call the current handler. The string is in the original
327 document encoding. This method doesn't return a meaningful string
328 inside declaration handlers.
329
330 default_current
331 When called from a handler, causes the sequence of characters that
332 generated the corresponding event to be sent to the default handler
333 (if one is registered). Use of this method is deprecated in favor
334 the recognized_string method, which you can use without installing
335 a default handler. This method doesn't deliver a meaningful string
336 to the default handler when called from inside declaration
337 handlers.
338
339 xpcroak(message)
340 Concatenate onto the given message the current line number within
341 the XML document plus the message implied by ErrorContext. Then
342 croak with the formed message.
343
344 xpcarp(message)
345 Concatenate onto the given message the current line number within
346 the XML document plus the message implied by ErrorContext. Then
347 carp with the formed message.
348
349 current_line
350 Returns the line number of the current position of the parse.
351
352 current_column
353 Returns the column number of the current position of the parse.
354
355 current_byte
356 Returns the current position of the parse.
357
358 base([NEWBASE]);
359 Returns the current value of the base for resolving relative URIs.
360 If NEWBASE is supplied, changes the base to that value.
361
362 context
363 Returns a list of element names that represent open elements, with
364 the last one being the innermost. Inside start and end tag
365 handlers, this will be the tag of the parent element.
366
367 current_element
368 Returns the name of the innermost currently opened element. Inside
369 start or end handlers, returns the parent of the element associated
370 with those tags.
371
372 in_element(NAME)
373 Returns true if NAME is equal to the name of the innermost
374 currently opened element. If namespace processing is being used and
375 you want to check against a name that may be in a namespace, then
376 use the generate_ns_name method to create the NAME argument.
377
378 within_element(NAME)
379 Returns the number of times the given name appears in the context
380 list. If namespace processing is being used and you want to check
381 against a name that may be in a namespace, then use the
382 generate_ns_name method to create the NAME argument.
383
384 depth
385 Returns the size of the context list.
386
387 element_index
388 Returns an integer that is the depth-first visit order of the
389 current element. This will be zero outside of the root element. For
390 example, this will return 1 when called from the start handler for
391 the root element start tag.
392
393 skip_until(INDEX)
394 INDEX is an integer that represents an element index. When this
395 method is called, all handlers are suspended until the start tag
396 for an element that has an index number equal to INDEX is seen. If
397 a start handler has been set, then this is the first tag that the
398 start handler will see after skip_until has been called.
399
400 position_in_context(LINES)
401 Returns a string that shows the current parse position. LINES
402 should be an integer >= 0 that represents the number of lines on
403 either side of the current parse line to place into the returned
404 string.
405
406 xml_escape(TEXT [, CHAR [, CHAR ...]])
407 Returns TEXT with markup characters turned into character entities.
408 Any additional characters provided as arguments are also turned
409 into character references where found in TEXT.
410
411 parse (SOURCE)
412 The SOURCE parameter should either be a string containing the whole
413 XML document, or it should be an open IO::Handle. Only a single
414 document may be parsed for a given instance of XML::Parser::Expat,
415 so this will croak if it's been called previously for this
416 instance.
417
418 parsestring(XML_DOC_STRING)
419 Parses the given string as an XML document. Only a single document
420 may be parsed for a given instance of XML::Parser::Expat, so this
421 will die if either parsestring or parsefile has been called for
422 this instance previously.
423
424 This method is deprecated in favor of the parse method.
425
426 parsefile(FILENAME)
427 Parses the XML document in the given file. Will die if parsestring
428 or parsefile has been called previously for this instance.
429
430 is_defaulted(ATTNAME)
431 NO LONGER WORKS. To find out if an attribute is defaulted please
432 use the specified_attr method.
433
434 specified_attr
435 When the start handler receives lists of attributes and values, the
436 non-defaulted (i.e. explicitly specified) attributes occur in the
437 list first. This method returns the number of specified items in
438 the list. So if this number is equal to the length of the list,
439 there were no defaulted values. Otherwise the number points to the
440 index of the first defaulted attribute name.
441
442 finish
443 Unsets all handlers (including internal ones that set context), but
444 expat continues parsing to the end of the document or until it
445 finds an error. It should finish up a lot faster than with the
446 handlers set.
447
448 release
449 There are data structures used by XML::Parser::Expat that have
450 circular references. This means that these structures will never be
451 garbage collected unless these references are explicitly broken.
452 Calling this method breaks those references (and makes the instance
453 unusable.)
454
455 Normally, higher level calls handle this for you, but if you are
456 using XML::Parser::Expat directly, then it's your responsibility to
457 call it.
458
459 XML::Parser::ContentModel Methods
460 The element declaration handlers are passed objects of this class as
461 the content model of the element declaration. They also represent
462 content particles, components of a content model.
463
464 When referred to as a string, these objects are automagicly converted
465 to a string representation of the model (or content particle).
466
467 isempty
468 This method returns true if the object is "EMPTY", false otherwise.
469
470 isany
471 This method returns true if the object is "ANY", false otherwise.
472
473 ismixed
474 This method returns true if the object is "(#PCDATA)" or
475 "(#PCDATA|...)*", false otherwise.
476
477 isname
478 This method returns if the object is an element name.
479
480 ischoice
481 This method returns true if the object is a choice of content
482 particles.
483
484 isseq
485 This method returns true if the object is a sequence of content
486 particles.
487
488 quant
489 This method returns undef or a string representing the quantifier
490 ('?', '*', '+') associated with the model or particle.
491
492 children
493 This method returns undef or (for mixed, choice, and sequence
494 types) an array of component content particles. There will always
495 be at least one component for choices and sequences, but for a
496 mixed content model of pure PCDATA, "(#PCDATA)", then an undef is
497 returned.
498
499 XML::Parser::ExpatNB Methods
500 The class XML::Parser::ExpatNB is a subclass of XML::Parser::Expat used
501 for non-blocking access to the expat library. It does not support the
502 parse, parsestring, or parsefile methods, but it does have these
503 additional methods:
504
505 parse_more(DATA)
506 Feed expat more text to munch on.
507
508 parse_done
509 Tell expat that it's gotten the whole document.
510
512 XML::Parser::Expat::load_encoding(ENCODING)
513 Load an external encoding. ENCODING is either the name of an
514 encoding or the name of a file. The basename is converted to
515 lowercase and a '.enc' extension is appended unless there's one
516 already there. Then, unless it's an absolute pathname (i.e. begins
517 with '/'), the first file by that name discovered in the
518 @Encoding_Path path list is used.
519
520 The encoding in the file is loaded and kept in the %Encoding_Table
521 table. Earlier encodings of the same name are replaced.
522
523 This function is automatically called by expat when it encounters
524 an encoding it doesn't know about. Expat shouldn't call this twice
525 for the same encoding name. The only reason users should use this
526 function is to explicitly load an encoding not contained in the
527 @Encoding_Path list.
528
530 Larry Wall <larry@wall.org> wrote version 1.0.
531
532 Clark Cooper <coopercc@netheaven.com> picked up support, changed the
533 API for this version (2.x), provided documentation, and added some
534 standard package features.
535
536
537
538perl v5.28.1 2015-01-12 Expat(3)