1Parser(3)             User Contributed Perl Documentation            Parser(3)
2
3
4

NAME

6       HTML::Parser - HTML parser class
7

SYNOPSIS

9        use HTML::Parser ();
10
11        # Create parser object
12        $p = HTML::Parser->new( api_version => 3,
13                                start_h => [\&start, "tagname, attr"],
14                                end_h   => [\&end,   "tagname"],
15                                marked_sections => 1,
16                              );
17
18        # Parse document text chunk by chunk
19        $p->parse($chunk1);
20        $p->parse($chunk2);
21        #...
22        $p->eof;                 # signal end of document
23
24        # Parse directly from file
25        $p->parse_file("foo.html");
26        # or
27        open(my $fh, "<:utf8", "foo.html") || die;
28        $p->parse_file($fh);
29

DESCRIPTION

31       Objects of the "HTML::Parser" class will recognize markup and separate
32       it from plain text (alias data content) in HTML documents.  As
33       different kinds of markup and text are recognized, the corresponding
34       event handlers are invoked.
35
36       "HTML::Parser" is not a generic SGML parser.  We have tried to make it
37       able to deal with the HTML that is actually "out there", and it
38       normally parses as closely as possible to the way the popular web
39       browsers do it instead of strictly following one of the many HTML
40       specifications from W3C.  Where there is disagreement, there is often
41       an option that you can enable to get the official behaviour.
42
43       The document to be parsed may be supplied in arbitrary chunks.  This
44       makes on-the-fly parsing as documents are received from the network
45       possible.
46
47       If event driven parsing does not feel right for your application, you
48       might want to use "HTML::PullParser".  This is an "HTML::Parser"
49       subclass that allows a more conventional program structure.
50

METHODS

52       The following method is used to construct a new "HTML::Parser" object:
53
54       $p = HTML::Parser->new( %options_and_handlers )
55           This class method creates a new "HTML::Parser" object and returns
56           it.  Key/value argument pairs may be provided to assign event
57           handlers or initialize parser options.  The handlers and parser
58           options can also be set or modified later by the method calls
59           described below.
60
61           If a top level key is in the form "<event>_h" (e.g., "text_h") then
62           it assigns a handler to that event, otherwise it initializes a
63           parser option. The event handler specification value must be an
64           array reference.  Multiple handlers may also be assigned with the
65           'handlers => [%handlers]' option.  See examples below.
66
67           If new() is called without any arguments, it will create a parser
68           that uses callback methods compatible with version 2 of
69           "HTML::Parser".  See the section on "version 2 compatibility" below
70           for details.
71
72           The special constructor option 'api_version => 2' can be used to
73           initialize version 2 callbacks while still setting other options
74           and handlers.  The 'api_version => 3' option can be used if you
75           don't want to set any options and don't want to fall back to v2
76           compatible mode.
77
78           Examples:
79
80            $p = HTML::Parser->new(api_version => 3,
81                                   text_h => [ sub {...}, "dtext" ]);
82
83           This creates a new parser object with a text event handler
84           subroutine that receives the original text with general entities
85           decoded.
86
87            $p = HTML::Parser->new(api_version => 3,
88                                   start_h => [ 'my_start', "self,tokens" ]);
89
90           This creates a new parser object with a start event handler method
91           that receives the $p and the tokens array.
92
93            $p = HTML::Parser->new(api_version => 3,
94                                   handlers => { text => [\@array, "event,text"],
95                                                 comment => [\@array, "event,text"],
96                                               });
97
98           This creates a new parser object that stores the event type and the
99           original text in @array for text and comment events.
100
101       The following methods feed the HTML document to the "HTML::Parser"
102       object:
103
104       $p->parse( $string )
105           Parse $string as the next chunk of the HTML document.  Handlers
106           invoked should not attempt to modify the $string in-place until
107           $p->parse returns.
108
109           If an invoked event handler aborts parsing by calling $p->eof, then
110           $p->parse() will return a FALSE value.  Otherwise the return value
111           is a reference to the parser object ($p).
112
113       $p->parse( $code_ref )
114           If a code reference is passed as the argument to be parsed, then
115           the chunks to be parsed are obtained by invoking this function
116           repeatedly.  Parsing continues until the function returns an empty
117           (or undefined) result.  When this happens $p->eof is automatically
118           signaled.
119
120           Parsing will also abort if one of the event handlers calls $p->eof.
121
122           The effect of this is the same as:
123
124            while (1) {
125               my $chunk = &$code_ref();
126               if (!defined($chunk) || !length($chunk)) {
127                   $p->eof;
128                   return $p;
129               }
130               $p->parse($chunk) || return undef;
131            }
132
133           But it is more efficient as this loop runs internally in XS code.
134
135       $p->parse_file( $file )
136           Parse text directly from a file.  The $file argument can be a
137           filename, an open file handle, or a reference to an open file
138           handle.
139
140           If $file contains a filename and the file can't be opened, then the
141           method returns an undefined value and $! tells why it failed.
142           Otherwise the return value is a reference to the parser object.
143
144           If a file handle is passed as the $file argument, then the file
145           will normally be read until EOF, but not closed.
146
147           If an invoked event handler aborts parsing by calling $p->eof, then
148           $p->parse_file() may not have read the entire file.
149
150           On systems with multi-byte line terminators, the values passed for
151           the offset and length argspecs may be too low if parse_file() is
152           called on a file handle that is not in binary mode.
153
154           If a filename is passed in, then parse_file() will open the file in
155           binary mode.
156
157       $p->eof
158           Signals the end of the HTML document.  Calling the $p->eof method
159           outside a handler callback will flush any remaining buffered text
160           (which triggers the "text" event if there is any remaining text).
161
162           Calling $p->eof inside a handler will terminate parsing at that
163           point and cause $p->parse to return a FALSE value.  This also
164           terminates parsing by $p->parse_file().
165
166           After $p->eof has been called, the parse() and parse_file() methods
167           can be invoked to feed new documents with the parser object.
168
169           The return value from eof() is a reference to the parser object.
170
171       Most parser options are controlled by boolean attributes.  Each boolean
172       attribute is enabled by calling the corresponding method with a TRUE
173       argument and disabled with a FALSE argument.  The attribute value is
174       left unchanged if no argument is given.  The return value from each
175       method is the old attribute value.
176
177       Methods that can be used to get and/or set parser options are:
178
179       $p->attr_encoded
180       $p->attr_encoded( $bool )
181           By default, the "attr" and @attr argspecs will have general
182           entities for attribute values decoded.  Enabling this attribute
183           leaves entities alone.
184
185       $p->backquote
186       $p->backquote( $bool )
187           By default, only ' and " are recognized as quote characters around
188           attribute values.  MSIE also recognizes backquotes for some reason.
189           Enabling this attribute provides compatibility with this behaviour.
190
191       $p->boolean_attribute_value( $val )
192           This method sets the value reported for boolean attributes inside
193           HTML start tags.  By default, the name of the attribute is also
194           used as its value.  This affects the values reported for "tokens"
195           and "attr" argspecs.
196
197       $p->case_sensitive
198       $p->case_sensitive( $bool )
199           By default, tagnames and attribute names are down-cased.  Enabling
200           this attribute leaves them as found in the HTML source document.
201
202       $p->closing_plaintext
203       $p->closing_plaintext( $bool )
204           By default, "plaintext" element can never be closed. Everything up
205           to the end of the document is parsed in CDATA mode.  This
206           historical behaviour is what at least MSIE does.  Enabling this
207           attribute makes closing "</plaintext>" tag effective and the
208           parsing process will resume after seeing this tag.  This emulates
209           early gecko-based browsers.
210
211       $p->empty_element_tags
212       $p->empty_element_tags( $bool )
213           By default, empty element tags are not recognized as such and the
214           "/" before ">" is just treated like a normal name character (unless
215           "strict_names" is enabled).  Enabling this attribute make
216           "HTML::Parser" recognize these tags.
217
218           Empty element tags look like start tags, but end with the character
219           sequence "/>" instead of ">".  When recognized by "HTML::Parser"
220           they cause an artificial end event in addition to the start event.
221           The "text" for the artificial end event will be empty and the
222           "tokenpos" array will be undefined even though the the token array
223           will have one element containing the tag name.
224
225       $p->marked_sections
226       $p->marked_sections( $bool )
227           By default, section markings like <![CDATA[...]]> are treated like
228           ordinary text.  When this attribute is enabled section markings are
229           honoured.
230
231           There are currently no events associated with the marked section
232           markup, but the text can be returned as "skipped_text".
233
234       $p->strict_comment
235       $p->strict_comment( $bool )
236           By default, comments are terminated by the first occurrence of
237           "-->".  This is the behaviour of most popular browsers (like
238           Mozilla, Opera and MSIE), but it is not correct according to the
239           official HTML standard.  Officially, you need an even number of
240           "--" tokens before the closing ">" is recognized and there may not
241           be anything but whitespace between an even and an odd "--".
242
243           The official behaviour is enabled by enabling this attribute.
244
245           Enabling of 'strict_comment' also disables recognizing these forms
246           as comments:
247
248             </ comment>
249             <! comment>
250
251       $p->strict_end
252       $p->strict_end( $bool )
253           By default, attributes and other junk are allowed to be present on
254           end tags in a manner that emulates MSIE's behaviour.
255
256           The official behaviour is enabled with this attribute.  If enabled,
257           only whitespace is allowed between the tagname and the final ">".
258
259       $p->strict_names
260       $p->strict_names( $bool )
261           By default, almost anything is allowed in tag and attribute names.
262           This is the behaviour of most popular browsers and allows us to
263           parse some broken tags with invalid attribute values like:
264
265              <IMG SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0>
266
267           By default, "LIST]" is parsed as a boolean attribute, not as part
268           of the ALT value as was clearly intended.  This is also what
269           Mozilla sees.
270
271           The official behaviour is enabled by enabling this attribute.  If
272           enabled, it will cause the tag above to be reported as text since
273           "LIST]" is not a legal attribute name.
274
275       $p->unbroken_text
276       $p->unbroken_text( $bool )
277           By default, blocks of text are given to the text handler as soon as
278           possible (but the parser takes care always to break text at a
279           boundary between whitespace and non-whitespace so single words and
280           entities can always be decoded safely).  This might create breaks
281           that make it hard to do transformations on the text. When this
282           attribute is enabled, blocks of text are always reported in one
283           piece.  This will delay the text event until the following (non-
284           text) event has been recognized by the parser.
285
286           Note that the "offset" argspec will give you the offset of the
287           first segment of text and "length" is the combined length of the
288           segments.  Since there might be ignored tags in between, these
289           numbers can't be used to directly index in the original document
290           file.
291
292       $p->utf8_mode
293       $p->utf8_mode( $bool )
294           Enable this option when parsing raw undecoded UTF-8.  This tells
295           the parser that the entities expanded for strings reported by
296           "attr", @attr and "dtext" should be expanded as decoded UTF-8 so
297           they end up compatible with the surrounding text.
298
299           If "utf8_mode" is enabled then it is an error to pass strings
300           containing characters with code above 255 to the parse() method,
301           and the parse() method will croak if you try.
302
303           Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when
304           UTF-8 encoded.  The character can also be represented by the entity
305           "&hearts;" or "&#x2665".  If we feed the parser:
306
307             $p->parse("\xE2\x99\xA5&hearts;");
308
309           then "dtext" will be reported as "\xE2\x99\xA5\x{2665}" without
310           "utf8_mode" enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when
311           enabled.  The later string is what you want.
312
313           This option is only available with perl-5.8 or better.
314
315       $p->xml_mode
316       $p->xml_mode( $bool )
317           Enabling this attribute changes the parser to allow some XML
318           constructs.  This enables the behaviour controlled by individually
319           by the "case_sensitive", "empty_element_tags", "strict_names" and
320           "xml_pic" attributes and also suppresses special treatment of
321           elements that are parsed as CDATA for HTML.
322
323       $p->xml_pic
324       $p->xml_pic( $bool )
325           By default, processing instructions are terminated by ">". When
326           this attribute is enabled, processing instructions are terminated
327           by "?>" instead.
328
329       As markup and text is recognized, handlers are invoked.  The following
330       method is used to set up handlers for different events:
331
332       $p->handler( event => \&subroutine, $argspec )
333       $p->handler( event => $method_name, $argspec )
334       $p->handler( event => \@accum, $argspec )
335       $p->handler( event => "" );
336       $p->handler( event => undef );
337       $p->handler( event );
338           This method assigns a subroutine, method, or array to handle an
339           event.
340
341           Event is one of "text", "start", "end", "declaration", "comment",
342           "process", "start_document", "end_document" or "default".
343
344           The "\&subroutine" is a reference to a subroutine which is called
345           to handle the event.
346
347           The $method_name is the name of a method of $p which is called to
348           handle the event.
349
350           The @accum is an array that will hold the event information as sub-
351           arrays.
352
353           If the second argument is "", the event is ignored.  If it is
354           undef, the default handler is invoked for the event.
355
356           The $argspec is a string that describes the information to be
357           reported for the event.  Any requested information that does not
358           apply to a specific event is passed as "undef".  If argspec is
359           omitted, then it is left unchanged.
360
361           The return value from $p->handler is the old callback routine or a
362           reference to the accumulator array.
363
364           Any return values from handler callback routines/methods are always
365           ignored.  A handler callback can request parsing to be aborted by
366           invoking the $p->eof method.  A handler callback is not allowed to
367           invoke the $p->parse() or $p->parse_file() method.  An exception
368           will be raised if it tries.
369
370           Examples:
371
372               $p->handler(start =>  "start", 'self, attr, attrseq, text' );
373
374           This causes the "start" method of object $p to be called for
375           'start' events.  The callback signature is $p->start(\%attr,
376           \@attr_seq, $text).
377
378               $p->handler(start =>  \&start, 'attr, attrseq, text' );
379
380           This causes subroutine start() to be called for 'start' events.
381           The callback signature is start(\%attr, \@attr_seq, $text).
382
383               $p->handler(start =>  \@accum, '"S", attr, attrseq, text' );
384
385           This causes 'start' event information to be saved in @accum.  The
386           array elements will be ['S', \%attr, \@attr_seq, $text].
387
388              $p->handler(start => "");
389
390           This causes 'start' events to be ignored.  It also suppresses
391           invocations of any default handler for start events.  It is in most
392           cases equivalent to $p->handler(start => sub {}), but is more
393           efficient.  It is different from the empty-sub-handler in that
394           "skipped_text" is not reset by it.
395
396              $p->handler(start => undef);
397
398           This causes no handler to be associated with start events.  If
399           there is a default handler it will be invoked.
400
401       Filters based on tags can be set up to limit the number of events
402       reported.  The main bottleneck during parsing is often the huge number
403       of callbacks made from the parser.  Applying filters can improve
404       performance significantly.
405
406       The following methods control filters:
407
408       $p->ignore_elements( @tags )
409           Both the "start" event and the "end" event as well as any events
410           that would be reported in between are suppressed.  The ignored
411           elements can contain nested occurrences of itself.  Example:
412
413              $p->ignore_elements(qw(script style));
414
415           The "script" and "style" tags will always nest properly since their
416           content is parsed in CDATA mode.  For most other tags
417           "ignore_elements" must be used with caution since HTML is often not
418           well formed.
419
420       $p->ignore_tags( @tags )
421           Any "start" and "end" events involving any of the tags given are
422           suppressed.  To reset the filter (i.e. don't suppress any "start"
423           and "end" events), call "ignore_tags" without an argument.
424
425       $p->report_tags( @tags )
426           Any "start" and "end" events involving any of the tags not given
427           are suppressed.  To reset the filter (i.e. report all "start" and
428           "end" events), call "report_tags" without an argument.
429
430       Internally, the system has two filter lists, one for "report_tags" and
431       one for "ignore_tags", and both filters are applied.  This effectively
432       gives "ignore_tags" precedence over "report_tags".
433
434       Examples:
435
436          $p->ignore_tags(qw(style));
437          $p->report_tags(qw(script style));
438
439       results in only "script" events being reported.
440
441   Argspec
442       Argspec is a string containing a comma-separated list that describes
443       the information reported by the event.  The following argspec
444       identifier names can be used:
445
446       "attr"
447           Attr causes a reference to a hash of attribute name/value pairs to
448           be passed.
449
450           Boolean attributes' values are either the value set by
451           $p->boolean_attribute_value, or the attribute name if no value has
452           been set by $p->boolean_attribute_value.
453
454           This passes undef except for "start" events.
455
456           Unless "xml_mode" or "case_sensitive" is enabled, the attribute
457           names are forced to lower case.
458
459           General entities are decoded in the attribute values and one layer
460           of matching quotes enclosing the attribute values is removed.
461
462           The Unicode character set is assumed for entity decoding.  With
463           Perl version 5.6 or earlier only the Latin-1 range is supported,
464           and entities for characters outside the range 0..255 are left
465           unchanged.
466
467       @attr
468           Basically the same as "attr", but keys and values are passed as
469           individual arguments and the original sequence of the attributes is
470           kept.  The parameters passed will be the same as the @attr
471           calculated here:
472
473              @attr = map { $_ => $attr->{$_} } @$attrseq;
474
475           assuming $attr and $attrseq here are the hash and array passed as
476           the result of "attr" and "attrseq" argspecs.
477
478           This passes no values for events besides "start".
479
480       "attrseq"
481           Attrseq causes a reference to an array of attribute names to be
482           passed.  This can be useful if you want to walk the "attr" hash in
483           the original sequence.
484
485           This passes undef except for "start" events.
486
487           Unless "xml_mode" or "case_sensitive" is enabled, the attribute
488           names are forced to lower case.
489
490       "column"
491           Column causes the column number of the start of the event to be
492           passed.  The first column on a line is 0.
493
494       "dtext"
495           Dtext causes the decoded text to be passed.  General entities are
496           automatically decoded unless the event was inside a CDATA section
497           or was between literal start and end tags ("script", "style",
498           "xmp", "iframe", "title", "textarea" and "plaintext").
499
500           The Unicode character set is assumed for entity decoding.  With
501           Perl version 5.6 or earlier only the Latin-1 range is supported,
502           and entities for characters outside the range 0..255 are left
503           unchanged.
504
505           This passes undef except for "text" events.
506
507       "event"
508           Event causes the event name to be passed.
509
510           The event name is one of "text", "start", "end", "declaration",
511           "comment", "process", "start_document" or "end_document".
512
513       "is_cdata"
514           Is_cdata causes a TRUE value to be passed if the event is inside a
515           CDATA section or between literal start and end tags ("script",
516           "style", "xmp", "iframe", "title", "textarea" and "plaintext").
517
518           if the flag is FALSE for a text event, then you should normally
519           either use "dtext" or decode the entities yourself before the text
520           is processed further.
521
522       "length"
523           Length causes the number of bytes of the source text of the event
524           to be passed.
525
526       "line"
527           Line causes the line number of the start of the event to be passed.
528           The first line in the document is 1.  Line counting doesn't start
529           until at least one handler requests this value to be reported.
530
531       "offset"
532           Offset causes the byte position in the HTML document of the start
533           of the event to be passed.  The first byte in the document has
534           offset 0.
535
536       "offset_end"
537           Offset_end causes the byte position in the HTML document of the end
538           of the event to be passed.  This is the same as "offset" +
539           "length".
540
541       "self"
542           Self causes the current object to be passed to the handler.  If the
543           handler is a method, this must be the first element in the argspec.
544
545           An alternative to passing self as an argspec is to register
546           closures that capture $self by themselves as handlers.
547           Unfortunately this creates circular references which prevent the
548           HTML::Parser object from being garbage collected.  Using the "self"
549           argspec avoids this problem.
550
551       "skipped_text"
552           Skipped_text returns the concatenated text of all the events that
553           have been skipped since the last time an event was reported.
554           Events might be skipped because no handler is registered for them
555           or because some filter applies.  Skipped text also includes marked
556           section markup, since there are no events that can catch it.
557
558           If an ""-handler is registered for an event, then the text for this
559           event is not included in "skipped_text".  Skipped text both before
560           and after the ""-event is included in the next reported
561           "skipped_text".
562
563       "tag"
564           Same as "tagname", but prefixed with "/" if it belongs to an "end"
565           event and "!" for a declaration.  The "tag" does not have any
566           prefix for "start" events, and is in this case identical to
567           "tagname".
568
569       "tagname"
570           This is the element name (or generic identifier in SGML jargon) for
571           start and end tags.  Since HTML is case insensitive, this name is
572           forced to lower case to ease string matching.
573
574           Since XML is case sensitive, the tagname case is not changed when
575           "xml_mode" is enabled.  The same happens if the "case_sensitive"
576           attribute is set.
577
578           The declaration type of declaration elements is also passed as a
579           tagname, even if that is a bit strange.  In fact, in the current
580           implementation tagname is identical to "token0" except that the
581           name may be forced to lower case.
582
583       "token0"
584           Token0 causes the original text of the first token string to be
585           passed.  This should always be the same as $tokens->[0].
586
587           For "declaration" events, this is the declaration type.
588
589           For "start" and "end" events, this is the tag name.
590
591           For "process" and non-strict "comment" events, this is everything
592           inside the tag.
593
594           This passes undef if there are no tokens in the event.
595
596       "tokenpos"
597           Tokenpos causes a reference to an array of token positions to be
598           passed.  For each string that appears in "tokens", this array
599           contains two numbers.  The first number is the offset of the start
600           of the token in the original "text" and the second number is the
601           length of the token.
602
603           Boolean attributes in a "start" event will have (0,0) for the
604           attribute value offset and length.
605
606           This passes undef if there are no tokens in the event (e.g.,
607           "text") and for artificial "end" events triggered by empty element
608           tags.
609
610           If you are using these offsets and lengths to modify "text", you
611           should either work from right to left, or be very careful to
612           calculate the changes to the offsets.
613
614       "tokens"
615           Tokens causes a reference to an array of token strings to be
616           passed.  The strings are exactly as they were found in the original
617           text, no decoding or case changes are applied.
618
619           For "declaration" events, the array contains each word, comment,
620           and delimited string starting with the declaration type.
621
622           For "comment" events, this contains each sub-comment.  If
623           $p->strict_comments is disabled, there will be only one sub-
624           comment.
625
626           For "start" events, this contains the original tag name followed by
627           the attribute name/value pairs.  The values of boolean attributes
628           will be either the value set by $p->boolean_attribute_value, or the
629           attribute name if no value has been set by
630           $p->boolean_attribute_value.
631
632           For "end" events, this contains the original tag name (always one
633           token).
634
635           For "process" events, this contains the process instructions
636           (always one token).
637
638           This passes "undef" for "text" events.
639
640       "text"
641           Text causes the source text (including markup element delimiters)
642           to be passed.
643
644       "undef"
645           Pass an undefined value.  Useful as padding where the same handler
646           routine is registered for multiple events.
647
648       '...'
649           A literal string of 0 to 255 characters enclosed in single (') or
650           double (") quotes is passed as entered.
651
652       The whole argspec string can be wrapped up in '@{...}' to signal that
653       the resulting event array should be flattened.  This only makes a
654       difference if an array reference is used as the handler target.
655       Consider this example:
656
657          $p->handler(text => [], 'text');
658          $p->handler(text => [], '@{text}']);
659
660       With two text events; "foo", "bar"; then the first example will end up
661       with [["foo"], ["bar"]] and the second with ["foo", "bar"] in the
662       handler target array.
663
664   Events
665       Handlers for the following events can be registered:
666
667       "comment"
668           This event is triggered when a markup comment is recognized.
669
670           Example:
671
672             <!-- This is a comment -- -- So is this -->
673
674       "declaration"
675           This event is triggered when a markup declaration is recognized.
676
677           For typical HTML documents, the only declaration you are likely to
678           find is <!DOCTYPE ...>.
679
680           Example:
681
682             <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
683             "http://www.w3.org/TR/html40/strict.dtd">
684
685           DTDs inside <!DOCTYPE ...> will confuse HTML::Parser.
686
687       "default"
688           This event is triggered for events that do not have a specific
689           handler.  You can set up a handler for this event to catch stuff
690           you did not want to catch explicitly.
691
692       "end"
693           This event is triggered when an end tag is recognized.
694
695           Example:
696
697             </A>
698
699       "end_document"
700           This event is triggered when $p->eof is called and after any
701           remaining text is flushed.  There is no document text associated
702           with this event.
703
704       "process"
705           This event is triggered when a processing instructions markup is
706           recognized.
707
708           The format and content of processing instructions are system and
709           application dependent.
710
711           Examples:
712
713             <? HTML processing instructions >
714             <? XML processing instructions ?>
715
716       "start"
717           This event is triggered when a start tag is recognized.
718
719           Example:
720
721             <A HREF="http://www.perl.com/">
722
723       "start_document"
724           This event is triggered before any other events for a new document.
725           A handler for it can be used to initialize stuff.  There is no
726           document text associated with this event.
727
728       "text"
729           This event is triggered when plain text (characters) is recognized.
730           The text may contain multiple lines.  A sequence of text may be
731           broken between several text events unless $p->unbroken_text is
732           enabled.
733
734           The parser will make sure that it does not break a word or a
735           sequence of whitespace between two text events.
736
737   Unicode
738       The "HTML::Parser" can parse Unicode strings when running under
739       perl-5.8 or better.  If Unicode is passed to $p->parse() then chunks of
740       Unicode will be reported to the handlers.  The offset and length
741       argspecs will also report their position in terms of characters.
742
743       It is safe to parse raw undecoded UTF-8 if you either avoid decoding
744       entities and make sure to not use argspecs that do, or enable the
745       "utf8_mode" for the parser.  Parsing of undecoded UTF-8 might be useful
746       when parsing from a file where you need the reported offsets and
747       lengths to match the byte offsets in the file.
748
749       If a filename is passed to $p->parse_file() then the file will be read
750       in binary mode.  This will be fine if the file contains only ASCII or
751       Latin-1 characters.  If the file contains UTF-8 encoded text then care
752       must be taken when decoding entities as described in the previous
753       paragraph, but better is to open the file with the UTF-8 layer so that
754       it is decoded properly:
755
756          open(my $fh, "<:utf8", "index.html") || die "...: $!";
757          $p->parse_file($fh);
758
759       If the file contains text encoded in a charset besides ASCII, Latin-1
760       or UTF-8 then decoding will always be needed.
761

VERSION 2 COMPATIBILITY

763       When an "HTML::Parser" object is constructed with no arguments, a set
764       of handlers is automatically provided that is compatible with the old
765       HTML::Parser version 2 callback methods.
766
767       This is equivalent to the following method calls:
768
769          $p->handler(start   => "start",   "self, tagname, attr, attrseq, text");
770          $p->handler(end     => "end",     "self, tagname, text");
771          $p->handler(text    => "text",    "self, text, is_cdata");
772          $p->handler(process => "process", "self, token0, text");
773          $p->handler(comment =>
774                    sub {
775                        my($self, $tokens) = @_;
776                        for (@$tokens) {$self->comment($_);}},
777                    "self, tokens");
778          $p->handler(declaration =>
779                    sub {
780                        my $self = shift;
781                        $self->declaration(substr($_[0], 2, -1));},
782                    "self, text");
783
784       Setting up these handlers can also be requested with the "api_version
785       => 2" constructor option.
786

SUBCLASSING

788       The "HTML::Parser" class is subclassable.  Parser objects are plain
789       hashes and "HTML::Parser" reserves only hash keys that start with
790       "_hparser".  The parser state can be set up by invoking the init()
791       method, which takes the same arguments as new().
792

EXAMPLES

794       The first simple example shows how you might strip out comments from an
795       HTML document.  We achieve this by setting up a comment handler that
796       does nothing and a default handler that will print out anything else:
797
798         use HTML::Parser;
799         HTML::Parser->new(default_h => [sub { print shift }, 'text'],
800                           comment_h => [""],
801                          )->parse_file(shift || die) || die $!;
802
803       An alternative implementation is:
804
805         use HTML::Parser;
806         HTML::Parser->new(end_document_h => [sub { print shift },
807                                              'skipped_text'],
808                           comment_h      => [""],
809                          )->parse_file(shift || die) || die $!;
810
811       This will in most cases be much more efficient since only a single
812       callback will be made.
813
814       The next example prints out the text that is inside the <title> element
815       of an HTML document.  Here we start by setting up a start handler.
816       When it sees the title start tag it enables a text handler that prints
817       any text found and an end handler that will terminate parsing as soon
818       as the title end tag is seen:
819
820         use HTML::Parser ();
821
822         sub start_handler
823         {
824           return if shift ne "title";
825           my $self = shift;
826           $self->handler(text => sub { print shift }, "dtext");
827           $self->handler(end  => sub { shift->eof if shift eq "title"; },
828                                  "tagname,self");
829         }
830
831         my $p = HTML::Parser->new(api_version => 3);
832         $p->handler( start => \&start_handler, "tagname,self");
833         $p->parse_file(shift || die) || die $!;
834         print "\n";
835
836       More examples are found in the eg/ directory of the "HTML-Parser"
837       distribution: the program "hrefsub" shows how you can edit all links
838       found in a document; the program "htextsub" shows how to edit the text
839       only; the program "hstrip" shows how you can strip out certain
840       tags/elements and/or attributes; and the program "htext" show how to
841       obtain the plain text, but not any script/style content.
842
843       You can browse the eg/ directory online from the [Browse] link on the
844       http://search.cpan.org/~gaas/HTML-Parser/ page.
845

BUGS

847       The <style> and <script> sections do not end with the first "</", but
848       need the complete corresponding end tag.  The standard behaviour is not
849       really practical.
850
851       When the strict_comment option is enabled, we still recognize comments
852       where there is something other than whitespace between even and odd
853       "--" markers.
854
855       Once $p->boolean_attribute_value has been set, there is no way to
856       restore the default behaviour.
857
858       There is currently no way to get both quote characters into the same
859       literal argspec.
860
861       Empty tags, e.g. "<>" and "</>", are not recognized.  SGML allows them
862       to repeat the previous start tag or close the previous start tag
863       respectively.
864
865       NET tags, e.g. "code/.../" are not recognized.  This is SGML shorthand
866       for "<code>...</code>".
867
868       Unclosed start or end tags, e.g. "<tt<b>...</b</tt>" are not
869       recognized.
870

DIAGNOSTICS

872       The following messages may be produced by HTML::Parser.  The notation
873       in this listing is the same as used in perldiag:
874
875       Not a reference to a hash
876           (F) The object blessed into or subclassed from HTML::Parser is not
877           a hash as required by the HTML::Parser methods.
878
879       Bad signature in parser state object at %p
880           (F) The _hparser_xs_state element does not refer to a valid state
881           structure.  Something must have changed the internal value stored
882           in this hash element, or the memory has been overwritten.
883
884       _hparser_xs_state element is not a reference
885           (F) The _hparser_xs_state element has been destroyed.
886
887       Can't find '_hparser_xs_state' element in HTML::Parser hash
888           (F) The _hparser_xs_state element is missing from the parser hash.
889           It was either deleted, or not created when the object was created.
890
891       API version %s not supported by HTML::Parser %s
892           (F) The constructor option 'api_version' with an argument greater
893           than or equal to 4 is reserved for future extensions.
894
895       Bad constructor option '%s'
896           (F) An unknown constructor option key was passed to the new() or
897           init() methods.
898
899       Parse loop not allowed
900           (F) A handler invoked the parse() or parse_file() method.  This is
901           not permitted.
902
903       marked sections not supported
904           (F) The $p->marked_sections() method was invoked in a HTML::Parser
905           module that was compiled without support for marked sections.
906
907       Unknown boolean attribute (%d)
908           (F) Something is wrong with the internal logic that set up aliases
909           for boolean attributes.
910
911       Only code or array references allowed as handler
912           (F) The second argument for $p->handler must be either a subroutine
913           reference, then name of a subroutine or method, or a reference to
914           an array.
915
916       No handler for %s events
917           (F) The first argument to $p->handler must be a valid event name;
918           i.e. one of "start", "end", "text", "process", "declaration" or
919           "comment".
920
921       Unrecognized identifier %s in argspec
922           (F) The identifier is not a known argspec name.  Use one of the
923           names mentioned in the argspec section above.
924
925       Literal string is longer than 255 chars in argspec
926           (F) The current implementation limits the length of literals in an
927           argspec to 255 characters.  Make the literal shorter.
928
929       Backslash reserved for literal string in argspec
930           (F) The backslash character "\" is not allowed in argspec literals.
931           It is reserved to permit quoting inside a literal in a later
932           version.
933
934       Unterminated literal string in argspec
935           (F) The terminating quote character for a literal was not found.
936
937       Bad argspec (%s)
938           (F) Only identifier names, literals, spaces and commas are allowed
939           in argspecs.
940
941       Missing comma separator in argspec
942           (F) Identifiers in an argspec must be separated with ",".
943
944       Parsing of undecoded UTF-8 will give garbage when decoding entities
945           (W) The first chunk parsed appears to contain undecoded UTF-8 and
946           one or more argspecs that decode entities are used for the callback
947           handlers.
948
949           The result of decoding will be a mix of encoded and decoded
950           characters for any entities that expand to characters with code
951           above 127.  This is not a good thing.
952
953           The solution is to use the Encode::encode_utf8() on the data before
954           feeding it to the $p->parse().  For $p->parse_file() pass a file
955           that has been opened in ":utf8" mode.
956
957           The parser can process raw undecoded UTF-8 sanely if the
958           "utf8_mode" is enabled or if the "attr", "@attr" or "dtext"
959           argspecs is avoided.
960
961       Parsing string decoded with wrong endianness
962           (W) The first character in the document is U+FFFE.  This is not a
963           legal Unicode character but a byte swapped BOM.  The result of
964           parsing will likely be garbage.
965
966       Parsing of undecoded UTF-32
967           (W) The parser found the Unicode UTF-32 BOM signature at the start
968           of the document.  The result of parsing will likely be garbage.
969
970       Parsing of undecoded UTF-16
971           (W) The parser found the Unicode UTF-16 BOM signature at the start
972           of the document.  The result of parsing will likely be garbage.
973

SEE ALSO

975       HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser,
976       HTML::LinkExtor, HTML::Form
977
978       HTML::TreeBuilder (part of the HTML-Tree distribution)
979
980       http://www.w3.org/TR/html4
981
982       More information about marked sections and processing instructions may
983       be found at "http://www.sgml.u-net.com/book/sgml-8.htm".
984
986        Copyright 1996-2008 Gisle Aas. All rights reserved.
987        Copyright 1999-2000 Michael A. Chase.  All rights reserved.
988
989       This library is free software; you can redistribute it and/or modify it
990       under the same terms as Perl itself.
991
992
993
994perl v5.10.1                      2009-10-25                         Parser(3)
Impressum