1SAX::Intro(3) User Contributed Perl Documentation SAX::Intro(3)
2
3
4
6 XML::SAX::Intro - An Introduction to SAX Parsing with Perl
7
9 XML::SAX is a new way to work with XML Parsers in Perl. In this article
10 we'll discuss why you should be using SAX, why you should be using
11 XML::SAX, and we'll see some of the finer implementation details. The
12 text below assumes some familiarity with callback, or push based
13 parsing, but if you are unfamiliar with these techniques then a good
14 place to start is Kip Hampton's excellent series of articles on
15 XML.com.
16
18 The de-facto way of parsing XML under perl is to use Larry Wall and
19 Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
20 the expat XML parser library by James Clark. It has been a hugely
21 successful project, but suffers from a couple of rather major flaws.
22 Firstly it is a proprietary API, designed before the SAX API was
23 conceived, which means that it is not easily replaceable by other
24 streaming parsers. Secondly it's callbacks are subrefs. This doesn't
25 sound like much of an issue, but unfortunately leads to code like:
26
27 sub handle_start {
28 my ($e, $el, %attrs) = @_;
29 if ($el eq 'foo') {
30 $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
31 }
32 }
33
34 As you can see, we're using the $e object to hold our state
35 information, which is a bad idea because we don't own that object - we
36 didn't create it. It's an internal object of XML::Parser, that happens
37 to be a hashref. We could all too easily overwrite XML::Parser internal
38 state variables by using this, or Clark could change it to an array ref
39 (not that he would, because it would break so much code, but he could).
40
41 The only way currently with XML::Parser to safely maintain state is to
42 use a closure:
43
44 my $state = MyState->new();
45 $parser->setHandlers(Start => sub { handle_start($state, @_) });
46
47 This closure traps the $state variable, which now gets passed as the
48 first parameter to your callback. Unfortunately very few people use
49 this technique, as it is not documented in the XML::Parser POD files.
50
51 Another reason you might not want to use XML::Parser is because you
52 need some feature that it doesn't provide (such as validation), or you
53 might need to use a library that doesn't use expat, due to it not being
54 installed on your system, or due to having a restrictive ISP. Using SAX
55 allows you to work around these restrictions.
56
58 SAX stands for the Simple API for XML. And simple it really is.
59 Constructing a SAX parser and passing events to handlers is done as
60 simply as:
61
62 use XML::SAX;
63 use MySAXHandler;
64
65 my $parser = XML::SAX::ParserFactory->parser(
66 Handler => MySAXHandler->new
67 );
68
69 $parser->parse_uri("foo.xml");
70
71 The important concept to grasp here is that SAX uses a factory class
72 called XML::SAX::ParserFactory to create a new parser instance. The
73 reason for this is so that you can support other underlying parser
74 implementations for different feature sets. This is one thing that
75 XML::Parser has always sorely lacked.
76
77 In the code above we see the parse_uri method used, but we could have
78 equally well called parse_file, parse_string, or parse(). Please see
79 XML::SAX::Base for what these methods take as parameters, but don't be
80 fooled into believing parse_file takes a filename. No, it takes a file
81 handle, a glob, or a subclass of IO::Handle. Beware.
82
83 SAX works very similarly to XML::Parser's default callback method,
84 except it has one major difference: rather than setting individual
85 callbacks, you create a new class in which to recieve the callbacks.
86 Each callback is called as a method call on an instance of that handler
87 class. An example will best demonstrate this:
88
89 package MySAXHandler;
90 use base qw(XML::SAX::Base);
91
92 sub start_document {
93 my ($self, $doc) = @_;
94 # process document start event
95 }
96
97 sub start_element {
98 my ($self, $el) = @_;
99 # process element start event
100 }
101
102 Now, when we instantiate this as above, and parse some XML with this as
103 the handler, the methods start_document and start_element will be
104 called as method calls, so this would be the equivalent of directly
105 calling:
106
107 $object->start_element($el);
108
109 Notice how this is different to XML::Parser's calling style, which
110 calls:
111
112 start_element($e, $name, %attribs);
113
114 It's the difference between function calling and method calling which
115 allows you to subclass SAX handlers which contributes to SAX being a
116 powerful solution.
117
118 As you can see, unlike XML::Parser, we have to define a new package in
119 which to do our processing (there are hacks you can do to make this
120 uneccessary, but I'll leave figuring those out to the experts). The
121 biggest benefit of this is that you maintain your own state variable
122 ($self in the above example) thus freeing you of the concerns listed
123 above. It is also an improvement in maintainability - you can place the
124 code in a separate file if you wish to, and your callback methods are
125 always called the same thing, rather than having to choose a suitable
126 name for them as you had to with XML::Parser. This is an obvious win.
127
128 SAX parsers are also very flexible in how you pass a handler to them.
129 You can use a constructor parameter as we saw above, or we can pass the
130 handler directly in the call to one of the parse methods:
131
132 $parser->parse(Handler => $handler,
133 Source => { SystemId => "foo.xml" });
134 # or...
135 $parser->parse_file($fh, Handler => $handler);
136
137 This flexibility allows for one parser to be used in many different
138 scenarios throughout your script (though one shouldn't feel pressure to
139 use this method, as parser construction is generally not a time
140 consuming process).
141
143 The only other thing you need to know to understand basic SAX is the
144 structure of the parameters passed to each of the callbacks. In
145 XML::Parser, all parameters are passed as multiple options to the
146 callbacks, so for example the Start callback would be called as
147 my_start($e, $name, %attributes), and the PI callback would be called
148 as my_processing_instruction($e, $target, $data). In SAX, every
149 callback is passed a hash reference, containing entries that define our
150 "node". The key callbacks and the structures they receive are:
151
152 start_element
153 The start_element handler is called whenever a parser sees an opening
154 tag. It is passed an element structure consisting of:
155
156 LocalName
157 The name of the element minus any namespace prefix it may have come
158 with in the document.
159
160 NamespaceURI
161 The URI of the namespace associated with this element, or the empty
162 string for none.
163
164 Attributes
165 A set of attributes as described below.
166
167 Name
168 The name of the element as it was seen in the document (i.e.
169 including any prefix associated with it)
170
171 Prefix
172 The prefix used to qualify this element's namespace, or the empty
173 string if none.
174
175 The Attributes are a hash reference, keyed by what we have called
176 "James Clark" notation. This means that the attribute name has been
177 expanded to include any associated namespace URI, and put together as
178 {ns}name, where "ns" is the expanded namespace URI of the attribute if
179 and only if the attribute had a prefix, and "name" is the LocalName of
180 the attribute.
181
182 The value of each entry in the attributes hash is another hash
183 structure consisting of:
184
185 LocalName
186 The name of the attribute minus any namespace prefix it may have
187 come with in the document.
188
189 NamespaceURI
190 The URI of the namespace associated with this attribute. If the
191 attribute had no prefix, then this consists of just the empty
192 string.
193
194 Name
195 The attribute's name as it appeared in the document, including any
196 namespace prefix.
197
198 Prefix
199 The prefix used to qualify this attribute's namepace, or the empty
200 string if none.
201
202 Value
203 The value of the attribute.
204
205 So a full example, as output by Data::Dumper might be:
206
207 ....
208
209 end_element
210 The end_element handler is called either when a parser sees a closing
211 tag, or after start_element has been called for an empty element (do
212 note however that a parser may if it is so inclined call characters
213 with an empty string when it sees an empty element. There is no simple
214 way in SAX to determine if the parser in fact saw an empty element, a
215 start and end element with no content..
216
217 The end_element handler receives exactly the same structure as
218 start_element, minus the Attributes entry. One must note though that it
219 should not be a reference to the same data as start_element receives,
220 so you may change the values in start_element but this will not affect
221 the values later seen by end_element.
222
223 characters
224 The characters callback may be called in serveral circumstances. The
225 most obvious one is when seeing ordinary character data in the markup.
226 But it is also called for text in a CDATA section, and is also called
227 in other situations. A SAX parser has to make no guarantees whatsoever
228 about how many times it may call characters for a stretch of text in an
229 XML document - it may call once, or it may call once for every
230 character in the text. In order to work around this it is often
231 important for the SAX developer to use a bundling technique, where text
232 is gathered up and processed in one of the other callbacks. This is not
233 always necessary, but it is a worthwhile technique to learn, which we
234 will cover in XML::SAX::Advanced (when I get around to writing it).
235
236 The characters handler is called with a very simple structure - a hash
237 reference consisting of just one entry:
238
239 Data
240 The text data that was received.
241
242 comment
243 The comment callback is called for comment text. Unlike with
244 "characters()", the comment callback *must* be invoked just once for an
245 entire comment string. It receives a single simple structure - a hash
246 reference containing just one entry:
247
248 Data
249 The text of the comment.
250
251 processing_instruction
252 The processing instruction handler is called for all processing
253 instructions in the document. Note that these processing instructions
254 may appear before the document root element, or after it, or anywhere
255 where text and elements would normally appear within the document,
256 according to the XML specification.
257
258 The handler is passed a structure containing just two entries:
259
260 Target
261 The target of the processing instrcution
262
263 Data
264 The text data in the processing instruction. Can be an empty string
265 for a processing instruction that has no data element. For example
266 <?wiggle?> is a perfectly valid processing instruction.
267
269 What we have discussed above is really the tip of the SAX iceberg. And
270 so far it looks like there's not much of interest to SAX beyond what we
271 have seen with XML::Parser. But it does go much further than that, I
272 promise.
273
274 People who hate Object Oriented code for the sake of it may be thinking
275 here that creating a new package just to parse something is a waste
276 when they've been parsing things just fine up to now using procedural
277 code. But there's reason to all this madness. And that reason is SAX
278 Filters.
279
280 As you saw right at the very start, to let the parser know about our
281 class, we pass it an instance of our class as the Handler to the
282 parser. But now imagine what would happen if our class could also take
283 a Handler option, and simply do some processing and pass on our data
284 further down the line? That in a nutshell is how SAX filters work. It's
285 Unix pipes for the 21st century!
286
287 There are two downsides to this. Number 1 - writing SAX filters can be
288 tricky. If you look into the future and read the advanced tutorial I'm
289 writing, you'll see that Handler can come in several shapes and sizes.
290 So making sure your filter does the right thing can be tricky.
291 Secondly, constructing complex filter chains can be difficult, and
292 simple thinking tells us that we only get one pass at our document,
293 when often we'll need more than that.
294
295 Luckily though, those downsides have been fixed by the release of two
296 very cool modules. What's even better is that I didn't write either of
297 them!
298
299 The first module is XML::SAX::Base. This is a VITAL SAX module that
300 acts as a base class for all SAX parsers and filters. It provides an
301 abstraction away from calling the handler methods, that makes sure your
302 filter or parser does the right thing, and it does it FAST. So, if you
303 ever need to write a SAX filter, which if you're processing XML -> XML,
304 or XML -> HTML, then you probably do, then you need to be writing it as
305 a subclass of XML::SAX::Base. Really - this is advice not to ignore
306 lightly. I will not go into the details of writing a SAX filter here.
307 Kip Hampton, the author of XML::SAX::Base has covered this nicely in
308 his article on XML.com here <URI>.
309
310 To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
311 whose modules you will probably have heard of or used, wrote a very
312 clever module called XML::SAX::Machines. This combines some really
313 clever SAX filter-type modules, with a construction toolkit for filters
314 that makes building pipelines easy. But before we see how it makes
315 things easy, first lets see how tricky it looks to build complex SAX
316 filter pipelines.
317
318 use XML::SAX::ParserFactory;
319 use XML::Filter::Filter1;
320 use XML::Filter::Filter2;
321 use XML::SAX::Writer;
322
323 my $output_string;
324 my $writer = XML::SAX::Writer->new(Output => \$output_string);
325 my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
326 my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
327 my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
328
329 $parser->parse_uri("foo.xml");
330
331 This is a lot easier with XML::SAX::Machines:
332
333 use XML::SAX::Machines qw(Pipeline);
334
335 my $output_string;
336 my $parser = Pipeline(
337 XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
338 );
339
340 $parser->parse_uri("foo.xml");
341
342 One of the main benefits of XML::SAX::Machines is that the pipelines
343 are constructed in natural order, rather than the reverse order we saw
344 with manual pipeline construction. XML::SAX::Machines takes care of all
345 the internals of pipe construction, providing you at the end with just
346 a parser you can use (and you can re-use the same parser as many times
347 as you need to).
348
349 Just a final tip. If you ever get stuck and are confused about what is
350 being passed from one SAX filter or parser to the next, then
351 Devel::TraceSAX will come to your rescue. This perl debugger plugin
352 will allow you to dump the SAX stream of events as it goes by. Usage is
353 really very simple just call your perl script that uses SAX as follows:
354
355 $ perl -d:TraceSAX <scriptname>
356
357 And preferably pipe the output to a pager of some sort, such as more or
358 less. The output is extremely verbose, but should help clear some
359 issues up.
360
362 Matt Sergeant, matt@sergeant.org
363
364 $Id$
365
366
367
368perl v5.16.3 2009-10-10 SAX::Intro(3)