HTML::Parser(3pm)

1Parser(3)             User Contributed Perl Documentation            Parser(3)
2
3
4

NAME

6       HTML::Parser - HTML parser class
7

SYNOPSIS

9        use HTML::Parser ();
10
11        # Create parser object
12        $p = HTML::Parser->new( api_version => 3,
13                                start_h => [\&start, "tagname, attr"],
14                                end_h   => [\&end,   "tagname"],
15                                marked_sections => 1,
16                              );
17
18        # Parse document text chunk by chunk
19        $p->parse($chunk1);
20        $p->parse($chunk2);
21        #...
22        $p->eof;                 # signal end of document
23
24        # Parse directly from file
25        $p->parse_file("foo.html");
26        # or
27        open(my $fh, "<:utf8", "foo.html") || die;
28        $p->parse_file($fh);
29

DESCRIPTION

31       Objects of the "HTML::Parser" class will recognize markup and separate
32       it from plain text (alias data content) in HTML documents.  As
33       different kinds of markup and text are recognized, the corresponding
34       event handlers are invoked.
35
36       "HTML::Parser" is not a generic SGML parser.  We have tried to make it
37       able to deal with the HTML that is actually "out there", and it
38       normally parses as closely as possible to the way the popular web
39       browsers do it instead of strictly following one of the many HTML
40       specifications from W3C.  Where there is disagreement, there is often
41       an option that you can enable to get the official behaviour.
42
43       The document to be parsed may be supplied in arbitrary chunks.  This
44       makes on-the-fly parsing as documents are received from the network
45       possible.
46
47       If event driven parsing does not feel right for your application, you
48       might want to use "HTML::PullParser".  This is an "HTML::Parser"
49       subclass that allows a more conventional program structure.
50

METHODS

52       The following method is used to construct a new "HTML::Parser" object:
53
54       $p = HTML::Parser->new( %options_and_handlers )
55           This class method creates a new "HTML::Parser" object and returns
56           it.  Key/value argument pairs may be provided to assign event
57           handlers or initialize parser options.  The handlers and parser
58           options can also be set or modified later by the method calls
59           described below.
60
61           If a top level key is in the form "<event>_h" (e.g., "text_h") then
62           it assigns a handler to that event, otherwise it initializes a
63           parser option. The event handler specification value must be an
64           array reference.  Multiple handlers may also be assigned with the
65           'handlers => [%handlers]' option.  See examples below.
66
67           If new() is called without any arguments, it will create a parser
68           that uses callback methods compatible with version 2 of
69           "HTML::Parser".  See the section on "version 2 compatibility" below
70           for details.
71
72           The special constructor option 'api_version => 2' can be used to
73           initialize version 2 callbacks while still setting other options
74           and handlers.  The 'api_version => 3' option can be used if you
75           don't want to set any options and don't want to fall back to v2
76           compatible mode.
77
78           Examples:
79
80            $p = HTML::Parser->new(api_version => 3,
81                                   text_h => [ sub {...}, "dtext" ]);
82
83           This creates a new parser object with a text event handler
84           subroutine that receives the original text with general entities
85           decoded.
86
87            $p = HTML::Parser->new(api_version => 3,
88                                   start_h => [ 'my_start', "self,tokens" ]);
89
90           This creates a new parser object with a start event handler method
91           that receives the $p and the tokens array.
92
93            $p = HTML::Parser->new(api_version => 3,
94                                   handlers => { text => [\@array, "event,text"],
95                                                 comment => [\@array, "event,text"],
96                                               });
97
98           This creates a new parser object that stores the event type and the
99           original text in @array for text and comment events.
100
101       The following methods feed the HTML document to the "HTML::Parser"
102       object:
103
104       $p->parse( $string )
105           Parse $string as the next chunk of the HTML document.  Handlers
106           invoked should not attempt to modify the $string in-place until
107           $p->parse returns.
108
109           If an invoked event handler aborts parsing by calling $p->eof, then
110           $p->parse() will return a FALSE value.  Otherwise the return value
111           is a reference to the parser object ($p).
112
113       $p->parse( $code_ref )
114           If a code reference is passed as the argument to be parsed, then
115           the chunks to be parsed are obtained by invoking this function
116           repeatedly.  Parsing continues until the function returns an empty
117           (or undefined) result.  When this happens $p->eof is automatically
118           signaled.
119
120           Parsing will also abort if one of the event handlers calls $p->eof.
121
122           The effect of this is the same as:
123
124            while (1) {
125               my $chunk = &$code_ref();
126               if (!defined($chunk) || !length($chunk)) {
127                   $p->eof;
128                   return $p;
129               }
130               $p->parse($chunk) || return undef;
131            }
132
133           But it is more efficient as this loop runs internally in XS code.
134
135       $p->parse_file( $file )
136           Parse text directly from a file.  The $file argument can be a
137           filename, an open file handle, or a reference to an open file
138           handle.
139
140           If $file contains a filename and the file can't be opened, then the
141           method returns an undefined value and $! tells why it failed.
142           Otherwise the return value is a reference to the parser object.
143
144           If a file handle is passed as the $file argument, then the file
145           will normally be read until EOF, but not closed.
146
147           If an invoked event handler aborts parsing by calling $p->eof, then
148           $p->parse_file() may not have read the entire file.
149
150           On systems with multi-byte line terminators, the values passed for
151           the offset and length argspecs may be too low if parse_file() is
152           called on a file handle that is not in binary mode.
153
154           If a filename is passed in, then parse_file() will open the file in
155           binary mode.
156
157       $p->eof
158           Signals the end of the HTML document.  Calling the $p->eof method
159           outside a handler callback will flush any remaining buffered text
160           (which triggers the "text" event if there is any remaining text).
161
162           Calling $p->eof inside a handler will terminate parsing at that
163           point and cause $p->parse to return a FALSE value.  This also
164           terminates parsing by $p->parse_file().
165
166           After $p->eof has been called, the parse() and parse_file() methods
167           can be invoked to feed new documents with the parser object.
168
169           The return value from eof() is a reference to the parser object.
170
171       Most parser options are controlled by boolean attributes.  Each boolean
172       attribute is enabled by calling the corresponding method with a TRUE
173       argument and disabled with a FALSE argument.  The attribute value is
174       left unchanged if no argument is given.  The return value from each
175       method is the old attribute value.
176
177       Methods that can be used to get and/or set parser options are:
178
179       $p->attr_encoded
180       $p->attr_encoded( $bool )
181           By default, the "attr" and @attr argspecs will have general
182           entities for attribute values decoded.  Enabling this attribute
183           leaves entities alone.
184
185       $p->backquote
186       $p->backquote( $bool )
187           By default, only ' and " are recognized as quote characters around
188           attribute values.  MSIE also recognizes backquotes for some reason.
189           Enabling this attribute provides compatibility with this behaviour.
190
191       $p->boolean_attribute_value( $val )
192           This method sets the value reported for boolean attributes inside
193           HTML start tags.  By default, the name of the attribute is also
194           used as its value.  This affects the values reported for "tokens"
195           and "attr" argspecs.
196
197       $p->case_sensitive
198       $p->case_sensitive( $bool )
199           By default, tagnames and attribute names are down-cased.  Enabling
200           this attribute leaves them as found in the HTML source document.
201
202       $p->closing_plaintext
203       $p->closing_plaintext( $bool )
204           By default, "plaintext" element can never be closed. Everything up
205           to the end of the document is parsed in CDATA mode.  This
206           historical behaviour is what at least MSIE does.  Enabling this
207           attribute makes closing "</plaintext>" tag effective and the
208           parsing process will resume after seeing this tag.  This emulates
209           early gecko-based browsers.
210
211       $p->empty_element_tags
212       $p->empty_element_tags( $bool )
213           By default, empty element tags are not recognized as such and the
214           "/" before ">" is just treated like a normal name character (unless
215           "strict_names" is enabled).  Enabling this attribute make
216           "HTML::Parser" recognize these tags.
217
218           Empty element tags look like start tags, but end with the character
219           sequence "/>" instead of ">".  When recognized by "HTML::Parser"
220           they cause an artificial end event in addition to the start event.
221           The "text" for the artificial end event will be empty and the
222           "tokenpos" array will be undefined even though the token array will
223           have one element containing the tag name.
224
225       $p->marked_sections
226       $p->marked_sections( $bool )
227           By default, section markings like <![CDATA[...]]> are treated like
228           ordinary text.  When this attribute is enabled section markings are
229           honoured.
230
231           There are currently no events associated with the marked section
232           markup, but the text can be returned as "skipped_text".
233
234       $p->strict_comment
235       $p->strict_comment( $bool )
236           By default, comments are terminated by the first occurrence of
237           "-->".  This is the behaviour of most popular browsers (like
238           Mozilla, Opera and MSIE), but it is not correct according to the
239           official HTML standard.  Officially, you need an even number of
240           "--" tokens before the closing ">" is recognized and there may not
241           be anything but whitespace between an even and an odd "--".
242
243           The official behaviour is enabled by enabling this attribute.
244
245           Enabling of 'strict_comment' also disables recognizing these forms
246           as comments:
247
248             </ comment>
249             <! comment>
250
251       $p->strict_end
252       $p->strict_end( $bool )
253           By default, attributes and other junk are allowed to be present on
254           end tags in a manner that emulates MSIE's behaviour.
255
256           The official behaviour is enabled with this attribute.  If enabled,
257           only whitespace is allowed between the tagname and the final ">".
258
259       $p->strict_names
260       $p->strict_names( $bool )
261           By default, almost anything is allowed in tag and attribute names.
262           This is the behaviour of most popular browsers and allows us to
263           parse some broken tags with invalid attribute values like:
264
265              <IMG SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0>
266
267           By default, "LIST]" is parsed as a boolean attribute, not as part
268           of the ALT value as was clearly intended.  This is also what
269           Mozilla sees.
270
271           The official behaviour is enabled by enabling this attribute.  If
272           enabled, it will cause the tag above to be reported as text since
273           "LIST]" is not a legal attribute name.
274
275       $p->unbroken_text
276       $p->unbroken_text( $bool )
277           By default, blocks of text are given to the text handler as soon as
278           possible (but the parser takes care always to break text at a
279           boundary between whitespace and non-whitespace so single words and
280           entities can always be decoded safely).  This might create breaks
281           that make it hard to do transformations on the text. When this
282           attribute is enabled, blocks of text are always reported in one
283           piece.  This will delay the text event until the following (non-
284           text) event has been recognized by the parser.
285
286           Note that the "offset" argspec will give you the offset of the
287           first segment of text and "length" is the combined length of the
288           segments.  Since there might be ignored tags in between, these
289           numbers can't be used to directly index in the original document
290           file.
291
292       $p->utf8_mode
293       $p->utf8_mode( $bool )
294           Enable this option when parsing raw undecoded UTF-8.  This tells
295           the parser that the entities expanded for strings reported by
296           "attr", @attr and "dtext" should be expanded as decoded UTF-8 so
297           they end up compatible with the surrounding text.
298
299           If "utf8_mode" is enabled then it is an error to pass strings
300           containing characters with code above 255 to the parse() method,
301           and the parse() method will croak if you try.
302
303           Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when
304           UTF-8 encoded.  The character can also be represented by the entity
305           "&hearts;" or "&#x2665".  If we feed the parser:
306
307             $p->parse("\xE2\x99\xA5&hearts;");
308
309           then "dtext" will be reported as "\xE2\x99\xA5\x{2665}" without
310           "utf8_mode" enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when
311           enabled.  The later string is what you want.
312
313           This option is only available with perl-5.8 or better.
314
315       $p->xml_mode
316       $p->xml_mode( $bool )
317           Enabling this attribute changes the parser to allow some XML
318           constructs.  This enables the behaviour controlled by individually
319           by the "case_sensitive", "empty_element_tags", "strict_names" and
320           "xml_pic" attributes and also suppresses special treatment of
321           elements that are parsed as CDATA for HTML.
322
323       $p->xml_pic
324       $p->xml_pic( $bool )
325           By default, processing instructions are terminated by ">". When
326           this attribute is enabled, processing instructions are terminated
327           by "?>" instead.
328
329       As markup and text is recognized, handlers are invoked.  The following
330       method is used to set up handlers for different events:
331
332       $p->handler( event => \&subroutine, $argspec )
333       $p->handler( event => $method_name, $argspec )
334       $p->handler( event => \@accum, $argspec )
335       $p->handler( event => "" );
336       $p->handler( event => undef );
337       $p->handler( event );
338           This method assigns a subroutine, method, or array to handle an
339           event.
340
341           Event is one of "text", "start", "end", "declaration", "comment",
342           "process", "start_document", "end_document" or "default".
343
344           The "\&subroutine" is a reference to a subroutine which is called
345           to handle the event.
346
347           The $method_name is the name of a method of $p which is called to
348           handle the event.
349
350           The @accum is an array that will hold the event information as sub-
351           arrays.
352
353           If the second argument is "", the event is ignored.  If it is
354           undef, the default handler is invoked for the event.
355
356           The $argspec is a string that describes the information to be
357           reported for the event.  Any requested information that does not
358           apply to a specific event is passed as "undef".  If argspec is
359           omitted, then it is left unchanged.
360
361           The return value from $p->handler is the old callback routine or a
362           reference to the accumulator array.
363
364           Any return values from handler callback routines/methods are always
365           ignored.  A handler callback can request parsing to be aborted by
366           invoking the $p->eof method.  A handler callback is not allowed to
367           invoke the $p->parse() or $p->parse_file() method.  An exception
368           will be raised if it tries.
369
370           Examples:
371
372               $p->handler(start =>  "start", 'self, attr, attrseq, text' );
373
374           This causes the "start" method of object $p to be called for
375           'start' events.  The callback signature is $p->start(\%attr,
376           \@attr_seq, $text).
377
378               $p->handler(start =>  \&start, 'attr, attrseq, text' );
379
380           This causes subroutine start() to be called for 'start' events.
381           The callback signature is start(\%attr, \@attr_seq, $text).
382
383               $p->handler(start =>  \@accum, '"S", attr, attrseq, text' );
384
385           This causes 'start' event information to be saved in @accum.  The
386           array elements will be ['S', \%attr, \@attr_seq, $text].
387
388              $p->handler(start => "");
389
390           This causes 'start' events to be ignored.  It also suppresses
391           invocations of any default handler for start events.  It is in most
392           cases equivalent to $p->handler(start => sub {}), but is more
393           efficient.  It is different from the empty-sub-handler in that
394           "skipped_text" is not reset by it.
395
396              $p->handler(start => undef);
397
398           This causes no handler to be associated with start events.  If
399           there is a default handler it will be invoked.
400
401       Filters based on tags can be set up to limit the number of events
402       reported.  The main bottleneck during parsing is often the huge number
403       of callbacks made from the parser.  Applying filters can improve
404       performance significantly.
405
406       The following methods control filters:
407
408       $p->ignore_elements( @tags )
409           Both the "start" event and the "end" event as well as any events
410           that would be reported in between are suppressed.  The ignored
411           elements can contain nested occurrences of itself.  Example:
412
413              $p->ignore_elements(qw(script style));
414
415           The "script" and "style" tags will always nest properly since their
416           content is parsed in CDATA mode.  For most other tags
417           "ignore_elements" must be used with caution since HTML is often not
418           well formed.
419
420       $p->ignore_tags( @tags )
421           Any "start" and "end" events involving any of the tags given are
422           suppressed.  To reset the filter (i.e. don't suppress any "start"
423           and "end" events), call "ignore_tags" without an argument.
424
425       $p->report_tags( @tags )
426           Any "start" and "end" events involving any of the tags not given
427           are suppressed.  To reset the filter (i.e. report all "start" and
428           "end" events), call "report_tags" without an argument.
429
430       Internally, the system has two filter lists, one for "report_tags" and
431       one for "ignore_tags", and both filters are applied.  This effectively
432       gives "ignore_tags" precedence over "report_tags".
433
434       Examples:
435
436          $p->ignore_tags(qw(style));
437          $p->report_tags(qw(script style));
438
439       results in only "script" events being reported.
440
441   Argspec
442       Argspec is a string containing a comma-separated list that describes
443       the information reported by the event.  The following argspec
444       identifier names can be used:
445
446       "attr"
447           Attr causes a reference to a hash of attribute name/value pairs to
448           be passed.
449
450           Boolean attributes' values are either the value set by
451           $p->boolean_attribute_value, or the attribute name if no value has
452           been set by $p->boolean_attribute_value.
453
454           This passes undef except for "start" events.
455
456           Unless "xml_mode" or "case_sensitive" is enabled, the attribute
457           names are forced to lower case.
458
459           General entities are decoded in the attribute values and one layer
460           of matching quotes enclosing the attribute values is removed.
461
462           The Unicode character set is assumed for entity decoding.
463
464       @attr
465           Basically the same as "attr", but keys and values are passed as
466           individual arguments and the original sequence of the attributes is
467           kept.  The parameters passed will be the same as the @attr
468           calculated here:
469
470              @attr = map { $_ => $attr->{$_} } @$attrseq;
471
472           assuming $attr and $attrseq here are the hash and array passed as
473           the result of "attr" and "attrseq" argspecs.
474
475           This passes no values for events besides "start".
476
477       "attrseq"
478           Attrseq causes a reference to an array of attribute names to be
479           passed.  This can be useful if you want to walk the "attr" hash in
480           the original sequence.
481
482           This passes undef except for "start" events.
483
484           Unless "xml_mode" or "case_sensitive" is enabled, the attribute
485           names are forced to lower case.
486
487       "column"
488           Column causes the column number of the start of the event to be
489           passed.  The first column on a line is 0.
490
491       "dtext"
492           Dtext causes the decoded text to be passed.  General entities are
493           automatically decoded unless the event was inside a CDATA section
494           or was between literal start and end tags ("script", "style",
495           "xmp", "iframe", "title", "textarea" and "plaintext").
496
497           The Unicode character set is assumed for entity decoding.  With
498           Perl version 5.6 or earlier only the Latin-1 range is supported,
499           and entities for characters outside the range 0..255 are left
500           unchanged.
501
502           This passes undef except for "text" events.
503
504       "event"
505           Event causes the event name to be passed.
506
507           The event name is one of "text", "start", "end", "declaration",
508           "comment", "process", "start_document" or "end_document".
509
510       "is_cdata"
511           Is_cdata causes a TRUE value to be passed if the event is inside a
512           CDATA section or between literal start and end tags ("script",
513           "style", "xmp", "iframe", "title", "textarea" and "plaintext").
514
515           if the flag is FALSE for a text event, then you should normally
516           either use "dtext" or decode the entities yourself before the text
517           is processed further.
518
519       "length"
520           Length causes the number of bytes of the source text of the event
521           to be passed.
522
523       "line"
524           Line causes the line number of the start of the event to be passed.
525           The first line in the document is 1.  Line counting doesn't start
526           until at least one handler requests this value to be reported.
527
528       "offset"
529           Offset causes the byte position in the HTML document of the start
530           of the event to be passed.  The first byte in the document has
531           offset 0.
532
533       "offset_end"
534           Offset_end causes the byte position in the HTML document of the end
535           of the event to be passed.  This is the same as "offset" +
536           "length".
537
538       "self"
539           Self causes the current object to be passed to the handler.  If the
540           handler is a method, this must be the first element in the argspec.
541
542           An alternative to passing self as an argspec is to register
543           closures that capture $self by themselves as handlers.
544           Unfortunately this creates circular references which prevent the
545           HTML::Parser object from being garbage collected.  Using the "self"
546           argspec avoids this problem.
547
548       "skipped_text"
549           Skipped_text returns the concatenated text of all the events that
550           have been skipped since the last time an event was reported.
551           Events might be skipped because no handler is registered for them
552           or because some filter applies.  Skipped text also includes marked
553           section markup, since there are no events that can catch it.
554
555           If an ""-handler is registered for an event, then the text for this
556           event is not included in "skipped_text".  Skipped text both before
557           and after the ""-event is included in the next reported
558           "skipped_text".
559
560       "tag"
561           Same as "tagname", but prefixed with "/" if it belongs to an "end"
562           event and "!" for a declaration.  The "tag" does not have any
563           prefix for "start" events, and is in this case identical to
564           "tagname".
565
566       "tagname"
567           This is the element name (or generic identifier in SGML jargon) for
568           start and end tags.  Since HTML is case insensitive, this name is
569           forced to lower case to ease string matching.
570
571           Since XML is case sensitive, the tagname case is not changed when
572           "xml_mode" is enabled.  The same happens if the "case_sensitive"
573           attribute is set.
574
575           The declaration type of declaration elements is also passed as a
576           tagname, even if that is a bit strange.  In fact, in the current
577           implementation tagname is identical to "token0" except that the
578           name may be forced to lower case.
579
580       "token0"
581           Token0 causes the original text of the first token string to be
582           passed.  This should always be the same as $tokens->[0].
583
584           For "declaration" events, this is the declaration type.
585
586           For "start" and "end" events, this is the tag name.
587
588           For "process" and non-strict "comment" events, this is everything
589           inside the tag.
590
591           This passes undef if there are no tokens in the event.
592
593       "tokenpos"
594           Tokenpos causes a reference to an array of token positions to be
595           passed.  For each string that appears in "tokens", this array
596           contains two numbers.  The first number is the offset of the start
597           of the token in the original "text" and the second number is the
598           length of the token.
599
600           Boolean attributes in a "start" event will have (0,0) for the
601           attribute value offset and length.
602
603           This passes undef if there are no tokens in the event (e.g.,
604           "text") and for artificial "end" events triggered by empty element
605           tags.
606
607           If you are using these offsets and lengths to modify "text", you
608           should either work from right to left, or be very careful to
609           calculate the changes to the offsets.
610
611       "tokens"
612           Tokens causes a reference to an array of token strings to be
613           passed.  The strings are exactly as they were found in the original
614           text, no decoding or case changes are applied.
615
616           For "declaration" events, the array contains each word, comment,
617           and delimited string starting with the declaration type.
618
619           For "comment" events, this contains each sub-comment.  If
620           $p->strict_comments is disabled, there will be only one sub-
621           comment.
622
623           For "start" events, this contains the original tag name followed by
624           the attribute name/value pairs.  The values of boolean attributes
625           will be either the value set by $p->boolean_attribute_value, or the
626           attribute name if no value has been set by
627           $p->boolean_attribute_value.
628
629           For "end" events, this contains the original tag name (always one
630           token).
631
632           For "process" events, this contains the process instructions
633           (always one token).
634
635           This passes "undef" for "text" events.
636
637       "text"
638           Text causes the source text (including markup element delimiters)
639           to be passed.
640
641       "undef"
642           Pass an undefined value.  Useful as padding where the same handler
643           routine is registered for multiple events.
644
645       '...'
646           A literal string of 0 to 255 characters enclosed in single (') or
647           double (") quotes is passed as entered.
648
649       The whole argspec string can be wrapped up in '@{...}' to signal that
650       the resulting event array should be flattened.  This only makes a
651       difference if an array reference is used as the handler target.
652       Consider this example:
653
654          $p->handler(text => [], 'text');
655          $p->handler(text => [], '@{text}']);
656
657       With two text events; "foo", "bar"; then the first example will end up
658       with [["foo"], ["bar"]] and the second with ["foo", "bar"] in the
659       handler target array.
660
661   Events
662       Handlers for the following events can be registered:
663
664       "comment"
665           This event is triggered when a markup comment is recognized.
666
667           Example:
668
669             <!-- This is a comment -- -- So is this -->
670
671       "declaration"
672           This event is triggered when a markup declaration is recognized.
673
674           For typical HTML documents, the only declaration you are likely to
675           find is <!DOCTYPE ...>.
676
677           Example:
678
679             <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
680                 "http://www.w3.org/TR/html4/strict.dtd">
681
682           DTDs inside <!DOCTYPE ...> will confuse HTML::Parser.
683
684       "default"
685           This event is triggered for events that do not have a specific
686           handler.  You can set up a handler for this event to catch stuff
687           you did not want to catch explicitly.
688
689       "end"
690           This event is triggered when an end tag is recognized.
691
692           Example:
693
694             </A>
695
696       "end_document"
697           This event is triggered when $p->eof is called and after any
698           remaining text is flushed.  There is no document text associated
699           with this event.
700
701       "process"
702           This event is triggered when a processing instructions markup is
703           recognized.
704
705           The format and content of processing instructions are system and
706           application dependent.
707
708           Examples:
709
710             <? HTML processing instructions >
711             <? XML processing instructions ?>
712
713       "start"
714           This event is triggered when a start tag is recognized.
715
716           Example:
717
718             <A HREF="http://www.perl.com/">
719
720       "start_document"
721           This event is triggered before any other events for a new document.
722           A handler for it can be used to initialize stuff.  There is no
723           document text associated with this event.
724
725       "text"
726           This event is triggered when plain text (characters) is recognized.
727           The text may contain multiple lines.  A sequence of text may be
728           broken between several text events unless $p->unbroken_text is
729           enabled.
730
731           The parser will make sure that it does not break a word or a
732           sequence of whitespace between two text events.
733
734   Unicode
735       "HTML::Parser" can parse Unicode strings when running under perl-5.8 or
736       better.  If Unicode is passed to $p->parse() then chunks of Unicode
737       will be reported to the handlers.  The offset and length argspecs will
738       also report their position in terms of characters.
739
740       It is safe to parse raw undecoded UTF-8 if you either avoid decoding
741       entities and make sure to not use argspecs that do, or enable the
742       "utf8_mode" for the parser.  Parsing of undecoded UTF-8 might be useful
743       when parsing from a file where you need the reported offsets and
744       lengths to match the byte offsets in the file.
745
746       If a filename is passed to $p->parse_file() then the file will be read
747       in binary mode.  This will be fine if the file contains only ASCII or
748       Latin-1 characters.  If the file contains UTF-8 encoded text then care
749       must be taken when decoding entities as described in the previous
750       paragraph, but better is to open the file with the UTF-8 layer so that
751       it is decoded properly:
752
753          open(my $fh, "<:utf8", "index.html") || die "...: $!";
754          $p->parse_file($fh);
755
756       If the file contains text encoded in a charset besides ASCII, Latin-1
757       or UTF-8 then decoding will always be needed.
758

VERSION 2 COMPATIBILITY

760       When an "HTML::Parser" object is constructed with no arguments, a set
761       of handlers is automatically provided that is compatible with the old
762       HTML::Parser version 2 callback methods.
763
764       This is equivalent to the following method calls:
765
766          $p->handler(start   => "start",   "self, tagname, attr, attrseq, text");
767          $p->handler(end     => "end",     "self, tagname, text");
768          $p->handler(text    => "text",    "self, text, is_cdata");
769          $p->handler(process => "process", "self, token0, text");
770          $p->handler(comment =>
771                    sub {
772                        my($self, $tokens) = @_;
773                        for (@$tokens) {$self->comment($_);}},
774                    "self, tokens");
775          $p->handler(declaration =>
776                    sub {
777                        my $self = shift;
778                        $self->declaration(substr($_[0], 2, -1));},
779                    "self, text");
780
781       Setting up these handlers can also be requested with the "api_version
782       => 2" constructor option.
783

SUBCLASSING

785       The "HTML::Parser" class is subclassable.  Parser objects are plain
786       hashes and "HTML::Parser" reserves only hash keys that start with
787       "_hparser".  The parser state can be set up by invoking the init()
788       method, which takes the same arguments as new().
789

EXAMPLES

791       The first simple example shows how you might strip out comments from an
792       HTML document.  We achieve this by setting up a comment handler that
793       does nothing and a default handler that will print out anything else:
794
795         use HTML::Parser;
796         HTML::Parser->new(default_h => [sub { print shift }, 'text'],
797                           comment_h => [""],
798                          )->parse_file(shift || die) || die $!;
799
800       An alternative implementation is:
801
802         use HTML::Parser;
803         HTML::Parser->new(end_document_h => [sub { print shift },
804                                              'skipped_text'],
805                           comment_h      => [""],
806                          )->parse_file(shift || die) || die $!;
807
808       This will in most cases be much more efficient since only a single
809       callback will be made.
810
811       The next example prints out the text that is inside the <title> element
812       of an HTML document.  Here we start by setting up a start handler.
813       When it sees the title start tag it enables a text handler that prints
814       any text found and an end handler that will terminate parsing as soon
815       as the title end tag is seen:
816
817         use HTML::Parser ();
818
819         sub start_handler
820         {
821           return if shift ne "title";
822           my $self = shift;
823           $self->handler(text => sub { print shift }, "dtext");
824           $self->handler(end  => sub { shift->eof if shift eq "title"; },
825                                  "tagname,self");
826         }
827
828         my $p = HTML::Parser->new(api_version => 3);
829         $p->handler( start => \&start_handler, "tagname,self");
830         $p->parse_file(shift || die) || die $!;
831         print "\n";
832
833       More examples are found in the eg/ directory of the "HTML-Parser"
834       distribution: the program "hrefsub" shows how you can edit all links
835       found in a document; the program "htextsub" shows how to edit the text
836       only; the program "hstrip" shows how you can strip out certain
837       tags/elements and/or attributes; and the program "htext" show how to
838       obtain the plain text, but not any script/style content.
839
840       You can browse the eg/ directory online from the [Browse] link on the
841       http://search.cpan.org/~gaas/HTML-Parser/ page.
842

BUGS

844       The <style> and <script> sections do not end with the first "</", but
845       need the complete corresponding end tag.  The standard behaviour is not
846       really practical.
847
848       When the strict_comment option is enabled, we still recognize comments
849       where there is something other than whitespace between even and odd
850       "--" markers.
851
852       Once $p->boolean_attribute_value has been set, there is no way to
853       restore the default behaviour.
854
855       There is currently no way to get both quote characters into the same
856       literal argspec.
857
858       Empty tags, e.g. "<>" and "</>", are not recognized.  SGML allows them
859       to repeat the previous start tag or close the previous start tag
860       respectively.
861
862       NET tags, e.g. "code/.../" are not recognized.  This is SGML shorthand
863       for "<code>...</code>".
864
865       Unclosed start or end tags, e.g. "<tt<b>...</b</tt>" are not
866       recognized.
867

DIAGNOSTICS

869       The following messages may be produced by HTML::Parser.  The notation
870       in this listing is the same as used in perldiag:
871
872       Not a reference to a hash
873           (F) The object blessed into or subclassed from HTML::Parser is not
874           a hash as required by the HTML::Parser methods.
875
876       Bad signature in parser state object at %p
877           (F) The _hparser_xs_state element does not refer to a valid state
878           structure.  Something must have changed the internal value stored
879           in this hash element, or the memory has been overwritten.
880
881       _hparser_xs_state element is not a reference
882           (F) The _hparser_xs_state element has been destroyed.
883
884       Can't find '_hparser_xs_state' element in HTML::Parser hash
885           (F) The _hparser_xs_state element is missing from the parser hash.
886           It was either deleted, or not created when the object was created.
887
888       API version %s not supported by HTML::Parser %s
889           (F) The constructor option 'api_version' with an argument greater
890           than or equal to 4 is reserved for future extensions.
891
892       Bad constructor option '%s'
893           (F) An unknown constructor option key was passed to the new() or
894           init() methods.
895
896       Parse loop not allowed
897           (F) A handler invoked the parse() or parse_file() method.  This is
898           not permitted.
899
900       marked sections not supported
901           (F) The $p->marked_sections() method was invoked in a HTML::Parser
902           module that was compiled without support for marked sections.
903
904       Unknown boolean attribute (%d)
905           (F) Something is wrong with the internal logic that set up aliases
906           for boolean attributes.
907
908       Only code or array references allowed as handler
909           (F) The second argument for $p->handler must be either a subroutine
910           reference, then name of a subroutine or method, or a reference to
911           an array.
912
913       No handler for %s events
914           (F) The first argument to $p->handler must be a valid event name;
915           i.e. one of "start", "end", "text", "process", "declaration" or
916           "comment".
917
918       Unrecognized identifier %s in argspec
919           (F) The identifier is not a known argspec name.  Use one of the
920           names mentioned in the argspec section above.
921
922       Literal string is longer than 255 chars in argspec
923           (F) The current implementation limits the length of literals in an
924           argspec to 255 characters.  Make the literal shorter.
925
926       Backslash reserved for literal string in argspec
927           (F) The backslash character "\" is not allowed in argspec literals.
928           It is reserved to permit quoting inside a literal in a later
929           version.
930
931       Unterminated literal string in argspec
932           (F) The terminating quote character for a literal was not found.
933
934       Bad argspec (%s)
935           (F) Only identifier names, literals, spaces and commas are allowed
936           in argspecs.
937
938       Missing comma separator in argspec
939           (F) Identifiers in an argspec must be separated with ",".
940
941       Parsing of undecoded UTF-8 will give garbage when decoding entities
942           (W) The first chunk parsed appears to contain undecoded UTF-8 and
943           one or more argspecs that decode entities are used for the callback
944           handlers.
945
946           The result of decoding will be a mix of encoded and decoded
947           characters for any entities that expand to characters with code
948           above 127.  This is not a good thing.
949
950           The recommended solution is to apply Encode::decode_utf8() on the
951           data before feeding it to the $p->parse().  For $p->parse_file()
952           pass a file that has been opened in ":utf8" mode.
953
954           The alternative solution is to enable the "utf8_mode" and not
955           decode before passing strings to $p->parse().  The parser can
956           process raw undecoded UTF-8 sanely if the "utf8_mode" is enabled,
957           or if the "attr", "@attr" or "dtext" argspecs are avoided.
958
959       Parsing string decoded with wrong endianness
960           (W) The first character in the document is U+FFFE.  This is not a
961           legal Unicode character but a byte swapped BOM.  The result of
962           parsing will likely be garbage.
963
964       Parsing of undecoded UTF-32
965           (W) The parser found the Unicode UTF-32 BOM signature at the start
966           of the document.  The result of parsing will likely be garbage.
967
968       Parsing of undecoded UTF-16
969           (W) The parser found the Unicode UTF-16 BOM signature at the start
970           of the document.  The result of parsing will likely be garbage.
971

COPYRIGHT

984        Copyright 1996-2016 Gisle Aas. All rights reserved.
985        Copyright 1999-2000 Michael A. Chase.  All rights reserved.
986
987       This library is free software; you can redistribute it and/or modify it
988       under the same terms as Perl itself.
989
990
991
992perl v5.30.1                      2020-02-04                         Parser(3)