1Parser(3) User Contributed Perl Documentation Parser(3)
2
3
4
6 HTML::Parser - HTML parser class
7
9 use HTML::Parser ();
10
11 # Create parser object
12 $p = HTML::Parser->new( api_version => 3,
13 start_h => [\&start, "tagname, attr"],
14 end_h => [\&end, "tagname"],
15 marked_sections => 1,
16 );
17
18 # Parse document text chunk by chunk
19 $p->parse($chunk1);
20 $p->parse($chunk2);
21 #...
22 $p->eof; # signal end of document
23
24 # Parse directly from file
25 $p->parse_file("foo.html");
26 # or
27 open(my $fh, "<:utf8", "foo.html") || die;
28 $p->parse_file($fh);
29
31 Objects of the "HTML::Parser" class will recognize markup and separate
32 it from plain text (alias data content) in HTML documents. As
33 different kinds of markup and text are recognized, the corresponding
34 event handlers are invoked.
35
36 "HTML::Parser" is not a generic SGML parser. We have tried to make it
37 able to deal with the HTML that is actually "out there", and it
38 normally parses as closely as possible to the way the popular web
39 browsers do it instead of strictly following one of the many HTML
40 specifications from W3C. Where there is disagreement, there is often
41 an option that you can enable to get the official behaviour.
42
43 The document to be parsed may be supplied in arbitrary chunks. This
44 makes on-the-fly parsing as documents are received from the network
45 possible.
46
47 If event driven parsing does not feel right for your application, you
48 might want to use "HTML::PullParser". This is an "HTML::Parser"
49 subclass that allows a more conventional program structure.
50
52 The following method is used to construct a new "HTML::Parser" object:
53
54 $p = HTML::Parser->new( %options_and_handlers )
55 This class method creates a new "HTML::Parser" object and returns
56 it. Key/value argument pairs may be provided to assign event
57 handlers or initialize parser options. The handlers and parser
58 options can also be set or modified later by the method calls
59 described below.
60
61 If a top level key is in the form "<event>_h" (e.g., "text_h") then
62 it assigns a handler to that event, otherwise it initializes a
63 parser option. The event handler specification value must be an
64 array reference. Multiple handlers may also be assigned with the
65 'handlers => [%handlers]' option. See examples below.
66
67 If new() is called without any arguments, it will create a parser
68 that uses callback methods compatible with version 2 of
69 "HTML::Parser". See the section on "version 2 compatibility" below
70 for details.
71
72 The special constructor option 'api_version => 2' can be used to
73 initialize version 2 callbacks while still setting other options
74 and handlers. The 'api_version => 3' option can be used if you
75 don't want to set any options and don't want to fall back to v2
76 compatible mode.
77
78 Examples:
79
80 $p = HTML::Parser->new(api_version => 3,
81 text_h => [ sub {...}, "dtext" ]);
82
83 This creates a new parser object with a text event handler
84 subroutine that receives the original text with general entities
85 decoded.
86
87 $p = HTML::Parser->new(api_version => 3,
88 start_h => [ 'my_start', "self,tokens" ]);
89
90 This creates a new parser object with a start event handler method
91 that receives the $p and the tokens array.
92
93 $p = HTML::Parser->new(api_version => 3,
94 handlers => { text => [\@array, "event,text"],
95 comment => [\@array, "event,text"],
96 });
97
98 This creates a new parser object that stores the event type and the
99 original text in @array for text and comment events.
100
101 The following methods feed the HTML document to the "HTML::Parser"
102 object:
103
104 $p->parse( $string )
105 Parse $string as the next chunk of the HTML document. Handlers
106 invoked should not attempt to modify the $string in-place until
107 $p->parse returns.
108
109 If an invoked event handler aborts parsing by calling $p->eof, then
110 $p->parse() will return a FALSE value. Otherwise the return value
111 is a reference to the parser object ($p).
112
113 $p->parse( $code_ref )
114 If a code reference is passed as the argument to be parsed, then
115 the chunks to be parsed are obtained by invoking this function
116 repeatedly. Parsing continues until the function returns an empty
117 (or undefined) result. When this happens $p->eof is automatically
118 signaled.
119
120 Parsing will also abort if one of the event handlers calls $p->eof.
121
122 The effect of this is the same as:
123
124 while (1) {
125 my $chunk = &$code_ref();
126 if (!defined($chunk) || !length($chunk)) {
127 $p->eof;
128 return $p;
129 }
130 $p->parse($chunk) || return undef;
131 }
132
133 But it is more efficient as this loop runs internally in XS code.
134
135 $p->parse_file( $file )
136 Parse text directly from a file. The $file argument can be a
137 filename, an open file handle, or a reference to an open file
138 handle.
139
140 If $file contains a filename and the file can't be opened, then the
141 method returns an undefined value and $! tells why it failed.
142 Otherwise the return value is a reference to the parser object.
143
144 If a file handle is passed as the $file argument, then the file
145 will normally be read until EOF, but not closed.
146
147 If an invoked event handler aborts parsing by calling $p->eof, then
148 $p->parse_file() may not have read the entire file.
149
150 On systems with multi-byte line terminators, the values passed for
151 the offset and length argspecs may be too low if parse_file() is
152 called on a file handle that is not in binary mode.
153
154 If a filename is passed in, then parse_file() will open the file in
155 binary mode.
156
157 $p->eof
158 Signals the end of the HTML document. Calling the $p->eof method
159 outside a handler callback will flush any remaining buffered text
160 (which triggers the "text" event if there is any remaining text).
161
162 Calling $p->eof inside a handler will terminate parsing at that
163 point and cause $p->parse to return a FALSE value. This also
164 terminates parsing by $p->parse_file().
165
166 After $p->eof has been called, the parse() and parse_file() methods
167 can be invoked to feed new documents with the parser object.
168
169 The return value from eof() is a reference to the parser object.
170
171 Most parser options are controlled by boolean attributes. Each boolean
172 attribute is enabled by calling the corresponding method with a TRUE
173 argument and disabled with a FALSE argument. The attribute value is
174 left unchanged if no argument is given. The return value from each
175 method is the old attribute value.
176
177 Methods that can be used to get and/or set parser options are:
178
179 $p->attr_encoded
180 $p->attr_encoded( $bool )
181 By default, the "attr" and @attr argspecs will have general
182 entities for attribute values decoded. Enabling this attribute
183 leaves entities alone.
184
185 $p->backquote
186 $p->backquote( $bool )
187 By default, only ' and " are recognized as quote characters around
188 attribute values. MSIE also recognizes backquotes for some reason.
189 Enabling this attribute provides compatibility with this behaviour.
190
191 $p->boolean_attribute_value( $val )
192 This method sets the value reported for boolean attributes inside
193 HTML start tags. By default, the name of the attribute is also
194 used as its value. This affects the values reported for "tokens"
195 and "attr" argspecs.
196
197 $p->case_sensitive
198 $p->case_sensitive( $bool )
199 By default, tagnames and attribute names are down-cased. Enabling
200 this attribute leaves them as found in the HTML source document.
201
202 $p->closing_plaintext
203 $p->closing_plaintext( $bool )
204 By default, "plaintext" element can never be closed. Everything up
205 to the end of the document is parsed in CDATA mode. This
206 historical behaviour is what at least MSIE does. Enabling this
207 attribute makes closing "</plaintext>" tag effective and the
208 parsing process will resume after seeing this tag. This emulates
209 early gecko-based browsers.
210
211 $p->empty_element_tags
212 $p->empty_element_tags( $bool )
213 By default, empty element tags are not recognized as such and the
214 "/" before ">" is just treated like a normal name character (unless
215 "strict_names" is enabled). Enabling this attribute make
216 "HTML::Parser" recognize these tags.
217
218 Empty element tags look like start tags, but end with the character
219 sequence "/>" instead of ">". When recognized by "HTML::Parser"
220 they cause an artificial end event in addition to the start event.
221 The "text" for the artificial end event will be empty and the
222 "tokenpos" array will be undefined even though the token array will
223 have one element containing the tag name.
224
225 $p->marked_sections
226 $p->marked_sections( $bool )
227 By default, section markings like <![CDATA[...]]> are treated like
228 ordinary text. When this attribute is enabled section markings are
229 honoured.
230
231 There are currently no events associated with the marked section
232 markup, but the text can be returned as "skipped_text".
233
234 $p->strict_comment
235 $p->strict_comment( $bool )
236 By default, comments are terminated by the first occurrence of
237 "-->". This is the behaviour of most popular browsers (like
238 Mozilla, Opera and MSIE), but it is not correct according to the
239 official HTML standard. Officially, you need an even number of
240 "--" tokens before the closing ">" is recognized and there may not
241 be anything but whitespace between an even and an odd "--".
242
243 The official behaviour is enabled by enabling this attribute.
244
245 Enabling of 'strict_comment' also disables recognizing these forms
246 as comments:
247
248 </ comment>
249 <! comment>
250
251 $p->strict_end
252 $p->strict_end( $bool )
253 By default, attributes and other junk are allowed to be present on
254 end tags in a manner that emulates MSIE's behaviour.
255
256 The official behaviour is enabled with this attribute. If enabled,
257 only whitespace is allowed between the tagname and the final ">".
258
259 $p->strict_names
260 $p->strict_names( $bool )
261 By default, almost anything is allowed in tag and attribute names.
262 This is the behaviour of most popular browsers and allows us to
263 parse some broken tags with invalid attribute values like:
264
265 <IMG SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0>
266
267 By default, "LIST]" is parsed as a boolean attribute, not as part
268 of the ALT value as was clearly intended. This is also what
269 Mozilla sees.
270
271 The official behaviour is enabled by enabling this attribute. If
272 enabled, it will cause the tag above to be reported as text since
273 "LIST]" is not a legal attribute name.
274
275 $p->unbroken_text
276 $p->unbroken_text( $bool )
277 By default, blocks of text are given to the text handler as soon as
278 possible (but the parser takes care always to break text at a
279 boundary between whitespace and non-whitespace so single words and
280 entities can always be decoded safely). This might create breaks
281 that make it hard to do transformations on the text. When this
282 attribute is enabled, blocks of text are always reported in one
283 piece. This will delay the text event until the following (non-
284 text) event has been recognized by the parser.
285
286 Note that the "offset" argspec will give you the offset of the
287 first segment of text and "length" is the combined length of the
288 segments. Since there might be ignored tags in between, these
289 numbers can't be used to directly index in the original document
290 file.
291
292 $p->utf8_mode
293 $p->utf8_mode( $bool )
294 Enable this option when parsing raw undecoded UTF-8. This tells
295 the parser that the entities expanded for strings reported by
296 "attr", @attr and "dtext" should be expanded as decoded UTF-8 so
297 they end up compatible with the surrounding text.
298
299 If "utf8_mode" is enabled then it is an error to pass strings
300 containing characters with code above 255 to the parse() method,
301 and the parse() method will croak if you try.
302
303 Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when
304 UTF-8 encoded. The character can also be represented by the entity
305 "♥" or "♥". If we feed the parser:
306
307 $p->parse("\xE2\x99\xA5♥");
308
309 then "dtext" will be reported as "\xE2\x99\xA5\x{2665}" without
310 "utf8_mode" enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when
311 enabled. The later string is what you want.
312
313 This option is only available with perl-5.8 or better.
314
315 $p->xml_mode
316 $p->xml_mode( $bool )
317 Enabling this attribute changes the parser to allow some XML
318 constructs. This enables the behaviour controlled by individually
319 by the "case_sensitive", "empty_element_tags", "strict_names" and
320 "xml_pic" attributes and also suppresses special treatment of
321 elements that are parsed as CDATA for HTML.
322
323 $p->xml_pic
324 $p->xml_pic( $bool )
325 By default, processing instructions are terminated by ">". When
326 this attribute is enabled, processing instructions are terminated
327 by "?>" instead.
328
329 As markup and text is recognized, handlers are invoked. The following
330 method is used to set up handlers for different events:
331
332 $p->handler( event => \&subroutine, $argspec )
333 $p->handler( event => $method_name, $argspec )
334 $p->handler( event => \@accum, $argspec )
335 $p->handler( event => "" );
336 $p->handler( event => undef );
337 $p->handler( event );
338 This method assigns a subroutine, method, or array to handle an
339 event.
340
341 Event is one of "text", "start", "end", "declaration", "comment",
342 "process", "start_document", "end_document" or "default".
343
344 The "\&subroutine" is a reference to a subroutine which is called
345 to handle the event.
346
347 The $method_name is the name of a method of $p which is called to
348 handle the event.
349
350 The @accum is an array that will hold the event information as sub-
351 arrays.
352
353 If the second argument is "", the event is ignored. If it is
354 undef, the default handler is invoked for the event.
355
356 The $argspec is a string that describes the information to be
357 reported for the event. Any requested information that does not
358 apply to a specific event is passed as "undef". If argspec is
359 omitted, then it is left unchanged.
360
361 The return value from $p->handler is the old callback routine or a
362 reference to the accumulator array.
363
364 Any return values from handler callback routines/methods are always
365 ignored. A handler callback can request parsing to be aborted by
366 invoking the $p->eof method. A handler callback is not allowed to
367 invoke the $p->parse() or $p->parse_file() method. An exception
368 will be raised if it tries.
369
370 Examples:
371
372 $p->handler(start => "start", 'self, attr, attrseq, text' );
373
374 This causes the "start" method of object $p to be called for
375 'start' events. The callback signature is $p->start(\%attr,
376 \@attr_seq, $text).
377
378 $p->handler(start => \&start, 'attr, attrseq, text' );
379
380 This causes subroutine start() to be called for 'start' events.
381 The callback signature is start(\%attr, \@attr_seq, $text).
382
383 $p->handler(start => \@accum, '"S", attr, attrseq, text' );
384
385 This causes 'start' event information to be saved in @accum. The
386 array elements will be ['S', \%attr, \@attr_seq, $text].
387
388 $p->handler(start => "");
389
390 This causes 'start' events to be ignored. It also suppresses
391 invocations of any default handler for start events. It is in most
392 cases equivalent to $p->handler(start => sub {}), but is more
393 efficient. It is different from the empty-sub-handler in that
394 "skipped_text" is not reset by it.
395
396 $p->handler(start => undef);
397
398 This causes no handler to be associated with start events. If
399 there is a default handler it will be invoked.
400
401 Filters based on tags can be set up to limit the number of events
402 reported. The main bottleneck during parsing is often the huge number
403 of callbacks made from the parser. Applying filters can improve
404 performance significantly.
405
406 The following methods control filters:
407
408 $p->ignore_elements( @tags )
409 Both the "start" event and the "end" event as well as any events
410 that would be reported in between are suppressed. The ignored
411 elements can contain nested occurrences of itself. Example:
412
413 $p->ignore_elements(qw(script style));
414
415 The "script" and "style" tags will always nest properly since their
416 content is parsed in CDATA mode. For most other tags
417 "ignore_elements" must be used with caution since HTML is often not
418 well formed.
419
420 $p->ignore_tags( @tags )
421 Any "start" and "end" events involving any of the tags given are
422 suppressed. To reset the filter (i.e. don't suppress any "start"
423 and "end" events), call "ignore_tags" without an argument.
424
425 $p->report_tags( @tags )
426 Any "start" and "end" events involving any of the tags not given
427 are suppressed. To reset the filter (i.e. report all "start" and
428 "end" events), call "report_tags" without an argument.
429
430 Internally, the system has two filter lists, one for "report_tags" and
431 one for "ignore_tags", and both filters are applied. This effectively
432 gives "ignore_tags" precedence over "report_tags".
433
434 Examples:
435
436 $p->ignore_tags(qw(style));
437 $p->report_tags(qw(script style));
438
439 results in only "script" events being reported.
440
441 Argspec
442 Argspec is a string containing a comma-separated list that describes
443 the information reported by the event. The following argspec
444 identifier names can be used:
445
446 "attr"
447 Attr causes a reference to a hash of attribute name/value pairs to
448 be passed.
449
450 Boolean attributes' values are either the value set by
451 $p->boolean_attribute_value, or the attribute name if no value has
452 been set by $p->boolean_attribute_value.
453
454 This passes undef except for "start" events.
455
456 Unless "xml_mode" or "case_sensitive" is enabled, the attribute
457 names are forced to lower case.
458
459 General entities are decoded in the attribute values and one layer
460 of matching quotes enclosing the attribute values is removed.
461
462 The Unicode character set is assumed for entity decoding.
463
464 @attr
465 Basically the same as "attr", but keys and values are passed as
466 individual arguments and the original sequence of the attributes is
467 kept. The parameters passed will be the same as the @attr
468 calculated here:
469
470 @attr = map { $_ => $attr->{$_} } @$attrseq;
471
472 assuming $attr and $attrseq here are the hash and array passed as
473 the result of "attr" and "attrseq" argspecs.
474
475 This passes no values for events besides "start".
476
477 "attrseq"
478 Attrseq causes a reference to an array of attribute names to be
479 passed. This can be useful if you want to walk the "attr" hash in
480 the original sequence.
481
482 This passes undef except for "start" events.
483
484 Unless "xml_mode" or "case_sensitive" is enabled, the attribute
485 names are forced to lower case.
486
487 "column"
488 Column causes the column number of the start of the event to be
489 passed. The first column on a line is 0.
490
491 "dtext"
492 Dtext causes the decoded text to be passed. General entities are
493 automatically decoded unless the event was inside a CDATA section
494 or was between literal start and end tags ("script", "style",
495 "xmp", "iframe", "title", "textarea" and "plaintext").
496
497 The Unicode character set is assumed for entity decoding. With
498 Perl version 5.6 or earlier only the Latin-1 range is supported,
499 and entities for characters outside the range 0..255 are left
500 unchanged.
501
502 This passes undef except for "text" events.
503
504 "event"
505 Event causes the event name to be passed.
506
507 The event name is one of "text", "start", "end", "declaration",
508 "comment", "process", "start_document" or "end_document".
509
510 "is_cdata"
511 Is_cdata causes a TRUE value to be passed if the event is inside a
512 CDATA section or between literal start and end tags ("script",
513 "style", "xmp", "iframe", "title", "textarea" and "plaintext").
514
515 if the flag is FALSE for a text event, then you should normally
516 either use "dtext" or decode the entities yourself before the text
517 is processed further.
518
519 "length"
520 Length causes the number of bytes of the source text of the event
521 to be passed.
522
523 "line"
524 Line causes the line number of the start of the event to be passed.
525 The first line in the document is 1. Line counting doesn't start
526 until at least one handler requests this value to be reported.
527
528 "offset"
529 Offset causes the byte position in the HTML document of the start
530 of the event to be passed. The first byte in the document has
531 offset 0.
532
533 "offset_end"
534 Offset_end causes the byte position in the HTML document of the end
535 of the event to be passed. This is the same as "offset" +
536 "length".
537
538 "self"
539 Self causes the current object to be passed to the handler. If the
540 handler is a method, this must be the first element in the argspec.
541
542 An alternative to passing self as an argspec is to register
543 closures that capture $self by themselves as handlers.
544 Unfortunately this creates circular references which prevent the
545 HTML::Parser object from being garbage collected. Using the "self"
546 argspec avoids this problem.
547
548 "skipped_text"
549 Skipped_text returns the concatenated text of all the events that
550 have been skipped since the last time an event was reported.
551 Events might be skipped because no handler is registered for them
552 or because some filter applies. Skipped text also includes marked
553 section markup, since there are no events that can catch it.
554
555 If an ""-handler is registered for an event, then the text for this
556 event is not included in "skipped_text". Skipped text both before
557 and after the ""-event is included in the next reported
558 "skipped_text".
559
560 "tag"
561 Same as "tagname", but prefixed with "/" if it belongs to an "end"
562 event and "!" for a declaration. The "tag" does not have any
563 prefix for "start" events, and is in this case identical to
564 "tagname".
565
566 "tagname"
567 This is the element name (or generic identifier in SGML jargon) for
568 start and end tags. Since HTML is case insensitive, this name is
569 forced to lower case to ease string matching.
570
571 Since XML is case sensitive, the tagname case is not changed when
572 "xml_mode" is enabled. The same happens if the "case_sensitive"
573 attribute is set.
574
575 The declaration type of declaration elements is also passed as a
576 tagname, even if that is a bit strange. In fact, in the current
577 implementation tagname is identical to "token0" except that the
578 name may be forced to lower case.
579
580 "token0"
581 Token0 causes the original text of the first token string to be
582 passed. This should always be the same as $tokens->[0].
583
584 For "declaration" events, this is the declaration type.
585
586 For "start" and "end" events, this is the tag name.
587
588 For "process" and non-strict "comment" events, this is everything
589 inside the tag.
590
591 This passes undef if there are no tokens in the event.
592
593 "tokenpos"
594 Tokenpos causes a reference to an array of token positions to be
595 passed. For each string that appears in "tokens", this array
596 contains two numbers. The first number is the offset of the start
597 of the token in the original "text" and the second number is the
598 length of the token.
599
600 Boolean attributes in a "start" event will have (0,0) for the
601 attribute value offset and length.
602
603 This passes undef if there are no tokens in the event (e.g.,
604 "text") and for artificial "end" events triggered by empty element
605 tags.
606
607 If you are using these offsets and lengths to modify "text", you
608 should either work from right to left, or be very careful to
609 calculate the changes to the offsets.
610
611 "tokens"
612 Tokens causes a reference to an array of token strings to be
613 passed. The strings are exactly as they were found in the original
614 text, no decoding or case changes are applied.
615
616 For "declaration" events, the array contains each word, comment,
617 and delimited string starting with the declaration type.
618
619 For "comment" events, this contains each sub-comment. If
620 $p->strict_comments is disabled, there will be only one sub-
621 comment.
622
623 For "start" events, this contains the original tag name followed by
624 the attribute name/value pairs. The values of boolean attributes
625 will be either the value set by $p->boolean_attribute_value, or the
626 attribute name if no value has been set by
627 $p->boolean_attribute_value.
628
629 For "end" events, this contains the original tag name (always one
630 token).
631
632 For "process" events, this contains the process instructions
633 (always one token).
634
635 This passes "undef" for "text" events.
636
637 "text"
638 Text causes the source text (including markup element delimiters)
639 to be passed.
640
641 "undef"
642 Pass an undefined value. Useful as padding where the same handler
643 routine is registered for multiple events.
644
645 '...'
646 A literal string of 0 to 255 characters enclosed in single (') or
647 double (") quotes is passed as entered.
648
649 The whole argspec string can be wrapped up in '@{...}' to signal that
650 the resulting event array should be flattened. This only makes a
651 difference if an array reference is used as the handler target.
652 Consider this example:
653
654 $p->handler(text => [], 'text');
655 $p->handler(text => [], '@{text}']);
656
657 With two text events; "foo", "bar"; then the first example will end up
658 with [["foo"], ["bar"]] and the second with ["foo", "bar"] in the
659 handler target array.
660
661 Events
662 Handlers for the following events can be registered:
663
664 "comment"
665 This event is triggered when a markup comment is recognized.
666
667 Example:
668
669 <!-- This is a comment -- -- So is this -->
670
671 "declaration"
672 This event is triggered when a markup declaration is recognized.
673
674 For typical HTML documents, the only declaration you are likely to
675 find is <!DOCTYPE ...>.
676
677 Example:
678
679 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
680 "http://www.w3.org/TR/html4/strict.dtd">
681
682 DTDs inside <!DOCTYPE ...> will confuse HTML::Parser.
683
684 "default"
685 This event is triggered for events that do not have a specific
686 handler. You can set up a handler for this event to catch stuff
687 you did not want to catch explicitly.
688
689 "end"
690 This event is triggered when an end tag is recognized.
691
692 Example:
693
694 </A>
695
696 "end_document"
697 This event is triggered when $p->eof is called and after any
698 remaining text is flushed. There is no document text associated
699 with this event.
700
701 "process"
702 This event is triggered when a processing instructions markup is
703 recognized.
704
705 The format and content of processing instructions are system and
706 application dependent.
707
708 Examples:
709
710 <? HTML processing instructions >
711 <? XML processing instructions ?>
712
713 "start"
714 This event is triggered when a start tag is recognized.
715
716 Example:
717
718 <A HREF="http://www.perl.com/">
719
720 "start_document"
721 This event is triggered before any other events for a new document.
722 A handler for it can be used to initialize stuff. There is no
723 document text associated with this event.
724
725 "text"
726 This event is triggered when plain text (characters) is recognized.
727 The text may contain multiple lines. A sequence of text may be
728 broken between several text events unless $p->unbroken_text is
729 enabled.
730
731 The parser will make sure that it does not break a word or a
732 sequence of whitespace between two text events.
733
734 Unicode
735 "HTML::Parser" can parse Unicode strings when running under perl-5.8 or
736 better. If Unicode is passed to $p->parse() then chunks of Unicode
737 will be reported to the handlers. The offset and length argspecs will
738 also report their position in terms of characters.
739
740 It is safe to parse raw undecoded UTF-8 if you either avoid decoding
741 entities and make sure to not use argspecs that do, or enable the
742 "utf8_mode" for the parser. Parsing of undecoded UTF-8 might be useful
743 when parsing from a file where you need the reported offsets and
744 lengths to match the byte offsets in the file.
745
746 If a filename is passed to $p->parse_file() then the file will be read
747 in binary mode. This will be fine if the file contains only ASCII or
748 Latin-1 characters. If the file contains UTF-8 encoded text then care
749 must be taken when decoding entities as described in the previous
750 paragraph, but better is to open the file with the UTF-8 layer so that
751 it is decoded properly:
752
753 open(my $fh, "<:utf8", "index.html") || die "...: $!";
754 $p->parse_file($fh);
755
756 If the file contains text encoded in a charset besides ASCII, Latin-1
757 or UTF-8 then decoding will always be needed.
758
760 When an "HTML::Parser" object is constructed with no arguments, a set
761 of handlers is automatically provided that is compatible with the old
762 HTML::Parser version 2 callback methods.
763
764 This is equivalent to the following method calls:
765
766 $p->handler(start => "start", "self, tagname, attr, attrseq, text");
767 $p->handler(end => "end", "self, tagname, text");
768 $p->handler(text => "text", "self, text, is_cdata");
769 $p->handler(process => "process", "self, token0, text");
770 $p->handler(comment =>
771 sub {
772 my($self, $tokens) = @_;
773 for (@$tokens) {$self->comment($_);}},
774 "self, tokens");
775 $p->handler(declaration =>
776 sub {
777 my $self = shift;
778 $self->declaration(substr($_[0], 2, -1));},
779 "self, text");
780
781 Setting up these handlers can also be requested with the "api_version
782 => 2" constructor option.
783
785 The "HTML::Parser" class is subclassable. Parser objects are plain
786 hashes and "HTML::Parser" reserves only hash keys that start with
787 "_hparser". The parser state can be set up by invoking the init()
788 method, which takes the same arguments as new().
789
791 The first simple example shows how you might strip out comments from an
792 HTML document. We achieve this by setting up a comment handler that
793 does nothing and a default handler that will print out anything else:
794
795 use HTML::Parser;
796 HTML::Parser->new(default_h => [sub { print shift }, 'text'],
797 comment_h => [""],
798 )->parse_file(shift || die) || die $!;
799
800 An alternative implementation is:
801
802 use HTML::Parser;
803 HTML::Parser->new(end_document_h => [sub { print shift },
804 'skipped_text'],
805 comment_h => [""],
806 )->parse_file(shift || die) || die $!;
807
808 This will in most cases be much more efficient since only a single
809 callback will be made.
810
811 The next example prints out the text that is inside the <title> element
812 of an HTML document. Here we start by setting up a start handler.
813 When it sees the title start tag it enables a text handler that prints
814 any text found and an end handler that will terminate parsing as soon
815 as the title end tag is seen:
816
817 use HTML::Parser ();
818
819 sub start_handler
820 {
821 return if shift ne "title";
822 my $self = shift;
823 $self->handler(text => sub { print shift }, "dtext");
824 $self->handler(end => sub { shift->eof if shift eq "title"; },
825 "tagname,self");
826 }
827
828 my $p = HTML::Parser->new(api_version => 3);
829 $p->handler( start => \&start_handler, "tagname,self");
830 $p->parse_file(shift || die) || die $!;
831 print "\n";
832
833 More examples are found in the eg/ directory of the "HTML-Parser"
834 distribution: the program "hrefsub" shows how you can edit all links
835 found in a document; the program "htextsub" shows how to edit the text
836 only; the program "hstrip" shows how you can strip out certain
837 tags/elements and/or attributes; and the program "htext" show how to
838 obtain the plain text, but not any script/style content.
839
840 You can browse the eg/ directory online from the [Browse] link on the
841 http://search.cpan.org/~gaas/HTML-Parser/ page.
842
844 The <style> and <script> sections do not end with the first "</", but
845 need the complete corresponding end tag. The standard behaviour is not
846 really practical.
847
848 When the strict_comment option is enabled, we still recognize comments
849 where there is something other than whitespace between even and odd
850 "--" markers.
851
852 Once $p->boolean_attribute_value has been set, there is no way to
853 restore the default behaviour.
854
855 There is currently no way to get both quote characters into the same
856 literal argspec.
857
858 Empty tags, e.g. "<>" and "</>", are not recognized. SGML allows them
859 to repeat the previous start tag or close the previous start tag
860 respectively.
861
862 NET tags, e.g. "code/.../" are not recognized. This is SGML shorthand
863 for "<code>...</code>".
864
865 Unclosed start or end tags, e.g. "<tt<b>...</b</tt>" are not
866 recognized.
867
869 The following messages may be produced by HTML::Parser. The notation
870 in this listing is the same as used in perldiag:
871
872 Not a reference to a hash
873 (F) The object blessed into or subclassed from HTML::Parser is not
874 a hash as required by the HTML::Parser methods.
875
876 Bad signature in parser state object at %p
877 (F) The _hparser_xs_state element does not refer to a valid state
878 structure. Something must have changed the internal value stored
879 in this hash element, or the memory has been overwritten.
880
881 _hparser_xs_state element is not a reference
882 (F) The _hparser_xs_state element has been destroyed.
883
884 Can't find '_hparser_xs_state' element in HTML::Parser hash
885 (F) The _hparser_xs_state element is missing from the parser hash.
886 It was either deleted, or not created when the object was created.
887
888 API version %s not supported by HTML::Parser %s
889 (F) The constructor option 'api_version' with an argument greater
890 than or equal to 4 is reserved for future extensions.
891
892 Bad constructor option '%s'
893 (F) An unknown constructor option key was passed to the new() or
894 init() methods.
895
896 Parse loop not allowed
897 (F) A handler invoked the parse() or parse_file() method. This is
898 not permitted.
899
900 marked sections not supported
901 (F) The $p->marked_sections() method was invoked in a HTML::Parser
902 module that was compiled without support for marked sections.
903
904 Unknown boolean attribute (%d)
905 (F) Something is wrong with the internal logic that set up aliases
906 for boolean attributes.
907
908 Only code or array references allowed as handler
909 (F) The second argument for $p->handler must be either a subroutine
910 reference, then name of a subroutine or method, or a reference to
911 an array.
912
913 No handler for %s events
914 (F) The first argument to $p->handler must be a valid event name;
915 i.e. one of "start", "end", "text", "process", "declaration" or
916 "comment".
917
918 Unrecognized identifier %s in argspec
919 (F) The identifier is not a known argspec name. Use one of the
920 names mentioned in the argspec section above.
921
922 Literal string is longer than 255 chars in argspec
923 (F) The current implementation limits the length of literals in an
924 argspec to 255 characters. Make the literal shorter.
925
926 Backslash reserved for literal string in argspec
927 (F) The backslash character "\" is not allowed in argspec literals.
928 It is reserved to permit quoting inside a literal in a later
929 version.
930
931 Unterminated literal string in argspec
932 (F) The terminating quote character for a literal was not found.
933
934 Bad argspec (%s)
935 (F) Only identifier names, literals, spaces and commas are allowed
936 in argspecs.
937
938 Missing comma separator in argspec
939 (F) Identifiers in an argspec must be separated with ",".
940
941 Parsing of undecoded UTF-8 will give garbage when decoding entities
942 (W) The first chunk parsed appears to contain undecoded UTF-8 and
943 one or more argspecs that decode entities are used for the callback
944 handlers.
945
946 The result of decoding will be a mix of encoded and decoded
947 characters for any entities that expand to characters with code
948 above 127. This is not a good thing.
949
950 The recommended solution is to apply Encode::decode_utf8() on the
951 data before feeding it to the $p->parse(). For $p->parse_file()
952 pass a file that has been opened in ":utf8" mode.
953
954 The alternative solution is to enable the "utf8_mode" and not
955 decode before passing strings to $p->parse(). The parser can
956 process raw undecoded UTF-8 sanely if the "utf8_mode" is enabled,
957 or if the "attr", "@attr" or "dtext" argspecs are avoided.
958
959 Parsing string decoded with wrong endianness
960 (W) The first character in the document is U+FFFE. This is not a
961 legal Unicode character but a byte swapped BOM. The result of
962 parsing will likely be garbage.
963
964 Parsing of undecoded UTF-32
965 (W) The parser found the Unicode UTF-32 BOM signature at the start
966 of the document. The result of parsing will likely be garbage.
967
968 Parsing of undecoded UTF-16
969 (W) The parser found the Unicode UTF-16 BOM signature at the start
970 of the document. The result of parsing will likely be garbage.
971
973 HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser,
974 HTML::LinkExtor, HTML::Form
975
976 HTML::TreeBuilder (part of the HTML-Tree distribution)
977
978 <http://www.w3.org/TR/html4/>
979
980 More information about marked sections and processing instructions may
981 be found at <http://www.is-thought.co.uk/book/sgml-8.htm>.
982
984 Copyright 1996-2016 Gisle Aas. All rights reserved.
985 Copyright 1999-2000 Michael A. Chase. All rights reserved.
986
987 This library is free software; you can redistribute it and/or modify it
988 under the same terms as Perl itself.
989
990
991
992perl v5.28.1 2016-01-19 Parser(3)