XML::SAX::Intro(3pm)

1SAX::Intro(3)         User Contributed Perl Documentation        SAX::Intro(3)
2
3
4

NAME

6       XML::SAX::Intro - An Introduction to SAX Parsing with Perl
7

Introduction

9       XML::SAX is a new way to work with XML Parsers in Perl. In this article
10       we'll discuss why you should be using SAX, why you should be using
11       XML::SAX, and we'll see some of the finer implementation details. The
12       text below assumes some familiarity with callback, or push based pars‐
13       ing, but if you are unfamiliar with these techniques then a good place
14       to start is Kip Hampton's excellent series of articles on XML.com.
15

Replacing XML::Parser

17       The de-facto way of parsing XML under perl is to use Larry Wall and
18       Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around
19       the expat XML parser library by James Clark. It has been a hugely suc‐
20       cessful project, but suffers from a couple of rather major flaws.
21       Firstly it is a proprietary API, designed before the SAX API was con‐
22       ceived, which means that it is not easily replaceable by other stream‐
23       ing parsers. Secondly it's callbacks are subrefs. This doesn't sound
24       like much of an issue, but unfortunately leads to code like:
25
26         sub handle_start {
27           my ($e, $el, %attrs) = @_;
28           if ($el eq 'foo') {
29             $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object.
30           }
31         }
32
33       As you can see, we're using the $e object to hold our state informa‐
34       tion, which is a bad idea because we don't own that object - we didn't
35       create it. It's an internal object of XML::Parser, that happens to be a
36       hashref. We could all too easily overwrite XML::Parser internal state
37       variables by using this, or Clark could change it to an array ref (not
38       that he would, because it would break so much code, but he could).
39
40       The only way currently with XML::Parser to safely maintain state is to
41       use a closure:
42
43         my $state = MyState->new();
44         $parser->setHandlers(Start => sub { handle_start($state, @_) });
45
46       This closure traps the $state variable, which now gets passed as the
47       first parameter to your callback. Unfortunately very few people use
48       this technique, as it is not documented in the XML::Parser POD files.
49
50       Another reason you might not want to use XML::Parser is because you
51       need some feature that it doesn't provide (such as validation), or you
52       might need to use a library that doesn't use expat, due to it not being
53       installed on your system, or due to having a restrictive ISP. Using SAX
54       allows you to work around these restrictions.
55

Introducing SAX

57       SAX stands for the Simple API for XML. And simple it really is.  Con‐
58       structing a SAX parser and passing events to handlers is done as simply
59       as:
60
61         use XML::SAX;
62         use MySAXHandler;
63
64         my $parser = XML::SAX::ParserFactory->parser(
65               Handler => MySAXHandler->new
66         );
67
68         $parser->parse_uri("foo.xml");
69
70       The important concept to grasp here is that SAX uses a factory class
71       called XML::SAX::ParserFactory to create a new parser instance. The
72       reason for this is so that you can support other underlying parser
73       implementations for different feature sets. This is one thing that
74       XML::Parser has always sorely lacked.
75
76       In the code above we see the parse_uri method used, but we could have
77       equally well called parse_file, parse_string, or parse(). Please see
78       XML::SAX::Base for what these methods take as parameters, but don't be
79       fooled into believing parse_file takes a filename. No, it takes a file
80       handle, a glob, or a subclass of IO::Handle. Beware.
81
82       SAX works very similarly to XML::Parser's default callback method,
83       except it has one major difference: rather than setting individual
84       callbacks, you create a new class in which to recieve the callbacks.
85       Each callback is called as a method call on an instance of that handler
86       class. An example will best demonstrate this:
87
88         package MySAXHandler;
89         use base qw(XML::SAX::Base);
90
91         sub start_document {
92           my ($self, $doc) = @_;
93           # process document start event
94         }
95
96         sub start_element {
97           my ($self, $el) = @_;
98           # process element start event
99         }
100
101       Now, when we instantiate this as above, and parse some XML with this as
102       the handler, the methods start_document and start_element will be
103       called as method calls, so this would be the equivalent of directly
104       calling:
105
106         $object->start_element($el);
107
108       Notice how this is different to XML::Parser's calling style, which
109       calls:
110
111         start_element($e, $name, %attribs);
112
113       It's the difference between function calling and method calling which
114       allows you to subclass SAX handlers which contributes to SAX being a
115       powerful solution.
116
117       As you can see, unlike XML::Parser, we have to define a new package in
118       which to do our processing (there are hacks you can do to make this
119       uneccessary, but I'll leave figuring those out to the experts). The
120       biggest benefit of this is that you maintain your own state variable
121       ($self in the above example) thus freeing you of the concerns listed
122       above. It is also an improvement in maintainability - you can place the
123       code in a separate file if you wish to, and your callback methods are
124       always called the same thing, rather than having to choose a suitable
125       name for them as you had to with XML::Parser. This is an obvious win.
126
127       SAX parsers are also very flexible in how you pass a handler to them.
128       You can use a constructor parameter as we saw above, or we can pass the
129       handler directly in the call to one of the parse methods:
130
131         $parser->parse(Handler => $handler,
132                        Source => { SystemId => "foo.xml" });
133         # or...
134         $parser->parse_file($fh, Handler => $handler);
135
136       This flexibility allows for one parser to be used in many different
137       scenarios throughout your script (though one shouldn't feel pressure to
138       use this method, as parser construction is generally not a time consum‐
139       ing process).
140

Callback Parameters

142       The only other thing you need to know to understand basic SAX is the
143       structure of the parameters passed to each of the callbacks. In
144       XML::Parser, all parameters are passed as multiple options to the call‐
145       backs, so for example the Start callback would be called as
146       my_start($e, $name, %attributes), and the PI callback would be called
147       as my_processing_instruction($e, $target, $data). In SAX, every call‐
148       back is passed a hash reference, containing entries that define our
149       "node". The key callbacks and the structures they receive are:
150
151       start_element
152
153       The start_element handler is called whenever a parser sees an opening
154       tag. It is passed an element structure consisting of:
155
156       LocalName
157           The name of the element minus any namespace prefix it may have come
158           with in the document.
159
160       NamespaceURI
161           The URI of the namespace associated with this element, or the empty
162           string for none.
163
164       Attributes
165           A set of attributes as described below.
166
167       Name
168           The name of the element as it was seen in the document (i.e.
169           including any prefix associated with it)
170
171       Prefix
172           The prefix used to qualify this element's namespace, or the empty
173           string if none.
174
175       The Attributes are a hash reference, keyed by what we have called
176       "James Clark" notation. This means that the attribute name has been
177       expanded to include any associated namespace URI, and put together as
178       {ns}name, where "ns" is the expanded namespace URI of the attribute if
179       and only if the attribute had a prefix, and "name" is the LocalName of
180       the attribute.
181
182       The value of each entry in the attributes hash is another hash struc‐
183       ture consisting of:
184
185       LocalName
186           The name of the attribute minus any namespace prefix it may have
187           come with in the document.
188
189       NamespaceURI
190           The URI of the namespace associated with this attribute. If the
191           attribute had no prefix, then this consists of just the empty
192           string.
193
194       Name
195           The attribute's name as it appeared in the document, including any
196           namespace prefix.
197
198       Prefix
199           The prefix used to qualify this attribute's namepace, or the empty
200           string if none.
201
202       Value
203           The value of the attribute.
204
205       So a full example, as output by Data::Dumper might be:
206
207         ....
208
209       end_element
210
211       The end_element handler is called either when a parser sees a closing
212       tag, or after start_element has been called for an empty element (do
213       note however that a parser may if it is so inclined call characters
214       with an empty string when it sees an empty element. There is no simple
215       way in SAX to determine if the parser in fact saw an empty element, a
216       start and end element with no content..
217
218       The end_element handler receives exactly the same structure as
219       start_element, minus the Attributes entry. One must note though that it
220       should not be a reference to the same data as start_element receives,
221       so you may change the values in start_element but this will not affect
222       the values later seen by end_element.
223
224       characters
225
226       The characters callback may be called in serveral circumstances. The
227       most obvious one is when seeing ordinary character data in the markup.
228       But it is also called for text in a CDATA section, and is also called
229       in other situations. A SAX parser has to make no guarantees whatsoever
230       about how many times it may call characters for a stretch of text in an
231       XML document - it may call once, or it may call once for every charac‐
232       ter in the text. In order to work around this it is often important for
233       the SAX developer to use a bundling technique, where text is gathered
234       up and processed in one of the other callbacks. This is not always nec‐
235       essary, but it is a worthwhile technique to learn, which we will cover
236       in XML::SAX::Advanced (when I get around to writing it).
237
238       The characters handler is called with a very simple structure - a hash
239       reference consisting of just one entry:
240
241       Data
242           The text data that was received.
243
244       comment
245
246       The comment callback is called for comment text. Unlike with "charac‐
247       ters()", the comment callback *must* be invoked just once for an entire
248       comment string. It receives a single simple structure - a hash refer‐
249       ence containing just one entry:
250
251       Data
252           The text of the comment.
253
254       processing_instruction
255
256       The processing instruction handler is called for all processing
257       instructions in the document. Note that these processing instructions
258       may appear before the document root element, or after it, or anywhere
259       where text and elements would normally appear within the document,
260       according to the XML specification.
261
262       The handler is passed a structure containing just two entries:
263
264       Target
265           The target of the processing instrcution
266
267       Data
268           The text data in the processing instruction. Can be an empty string
269           for a processing instruction that has no data element.  For example
270           <?wiggle?> is a perfectly valid processing instruction.
271

Tip of the iceberg

273       What we have discussed above is really the tip of the SAX iceberg. And
274       so far it looks like there's not much of interest to SAX beyond what we
275       have seen with XML::Parser. But it does go much further than that, I
276       promise.
277
278       People who hate Object Oriented code for the sake of it may be thinking
279       here that creating a new package just to parse something is a waste
280       when they've been parsing things just fine up to now using procedural
281       code. But there's reason to all this madness. And that reason is SAX
282       Filters.
283
284       As you saw right at the very start, to let the parser know about our
285       class, we pass it an instance of our class as the Handler to the
286       parser. But now imagine what would happen if our class could also take
287       a Handler option, and simply do some processing and pass on our data
288       further down the line? That in a nutshell is how SAX filters work. It's
289       Unix pipes for the 21st century!
290
291       There are two downsides to this. Number 1 - writing SAX filters can be
292       tricky. If you look into the future and read the advanced tutorial I'm
293       writing, you'll see that Handler can come in several shapes and sizes.
294       So making sure your filter does the right thing can be tricky.  Sec‐
295       ondly, constructing complex filter chains can be difficult, and simple
296       thinking tells us that we only get one pass at our document, when often
297       we'll need more than that.
298
299       Luckily though, those downsides have been fixed by the release of two
300       very cool modules. What's even better is that I didn't write either of
301       them!
302
303       The first module is XML::SAX::Base. This is a VITAL SAX module that
304       acts as a base class for all SAX parsers and filters. It provides an
305       abstraction away from calling the handler methods, that makes sure your
306       filter or parser does the right thing, and it does it FAST. So, if you
307       ever need to write a SAX filter, which if you're processing XML -> XML,
308       or XML -> HTML, then you probably do, then you need to be writing it as
309       a subclass of XML::SAX::Base. Really - this is advice not to ignore
310       lightly. I will not go into the details of writing a SAX filter here.
311       Kip Hampton, the author of XML::SAX::Base has covered this nicely in
312       his article on XML.com here <URI>.
313
314       To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker
315       who's modules you will probably have heard of or used, wrote a very
316       clever module called XML::SAX::Machines. This combines some really
317       clever SAX filter-type modules, with a construction toolkit for filters
318       that makes building pipelines easy. But before we see how it makes
319       things easy, first lets see how tricky it looks to build complex SAX
320       filter pipelines.
321
322         use XML::SAX::ParserFactory;
323         use XML::Filter::Filter1;
324         use XML::Filter::Filter2;
325         use XML::SAX::Writer;
326
327         my $output_string;
328         my $writer = XML::SAX::Writer->new(Output => \$output_string);
329         my $filter2 = XML::SAX::Filter2->new(Handler => $writer);
330         my $filter1 = XML::SAX::Filter1->new(Handler => $filter2);
331         my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1);
332
333         $parser->parse_uri("foo.xml");
334
335       This is a lot easier with XML::SAX::Machines:
336
337         use XML::SAX::Machines qw(Pipeline);
338
339         my $output_string;
340         my $parser = Pipeline(
341               XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string
342               );
343
344         $parser->parse_uri("foo.xml");
345
346       One of the main benefits of XML::SAX::Machines is that the pipelines
347       are constructed in natural order, rather than the reverse order we saw
348       with manual pipeline construction. XML::SAX::Machines takes care of all
349       the internals of pipe construction, providing you at the end with just
350       a parser you can use (and you can re-use the same parser as many times
351       as you need to).
352
353       Just a final tip. If you ever get stuck and are confused about what is
354       being passed from one SAX filter or parser to the next, then
355       Devel::TraceSAX will come to your rescue. This perl debugger plugin
356       will allow you to dump the SAX stream of events as it goes by. Usage is
357       really very simple just call your perl script that uses SAX as follows:
358
359         $ perl -d:TraceSAX <scriptname>
360
361       And preferably pipe the output to a pager of some sort, such as more or
362       less. The output is extremely verbose, but should help clear some
363       issues up.
364

AUTHOR

366       Matt Sergeant, matt@sergeant.org
367
368       $Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $
369
370
371
372perl v5.8.8                       2005-10-14                     SAX::Intro(3)