HTML::TokeParser::Simple(3pm)

1HTML::TokeParser::SimplUes(e3r)Contributed Perl DocumentHaTtMiLo:n:TokeParser::Simple(3)
2
3
4

NAME

6       HTML::TokeParser::Simple - Easy to use "HTML::TokeParser" interface
7

SYNOPSIS

9        use HTML::TokeParser::Simple;
10        my $p = HTML::TokeParser::Simple->new( $somefile );
11
12        while ( my $token = $p->get_token ) {
13            # This prints all text in an HTML doc (i.e., it strips the HTML)
14            next unless $token->is_text;
15            print $token->as_is;
16        }
17

DESCRIPTION

19       "HTML::TokeParser" is an excellent module that's often used for parsing
20       HTML.  However, the tokens returned are not exactly intuitive to parse:
21
22        ["S",  $tag, $attr, $attrseq, $text]
23        ["E",  $tag, $text]
24        ["T",  $text, $is_data]
25        ["C",  $text]
26        ["D",  $text]
27        ["PI", $token0, $text]
28
29       To simplify this, "HTML::TokeParser::Simple" allows the user ask more
30       intuitive (read: more self-documenting) questions about the tokens
31       returned.
32
33       You can also rebuild some tags on the fly.  Frequently, the attributes
34       associated with start tags need to be altered, added to, or deleted.
35       This functionality is built in.
36
37       Since this is a subclass of "HTML::TokeParser", all "HTML::TokeParser"
38       methods are available.  To truly appreciate the power of this module,
39       please read the documentation for "HTML::TokeParser" and
40       "HTML::Parser".
41

CONTRUCTORS

43   new($source)
44       The constructor for "HTML::TokeParser::Simple" can be used just like
45       "HTML::TokeParser"'s constructor:
46
47         my $parser = HTML::TokeParser::Simple->new($filename);
48         # or
49         my $parser = HTML::TokeParser::Simple->new($filehandle);
50         # or
51         my $parser = HTML::TokeParser::Simple->new(\$html_string);
52
53   "new($source_type, $source)"
54       If you wish to be more explicit, there is a new style of constructor
55       available.
56
57         my $parser = HTML::TokeParser::Simple->new(file   => $filename);
58         # or
59         my $parser = HTML::TokeParser::Simple->new(handle => $filehandle);
60         # or
61         my $parser = HTML::TokeParser::Simple->new(string => $html_string);
62
63       Note that you do not have to provide a reference for the string if
64       using the string constructor.
65
66       As a convenience, you can also attempt to fetch the HTML directly from
67       a URL.
68
69         my $parser = HTML::TokeParser::Simple->new(url => 'http://some.url');
70
71       This method relies on "LWP::Simple".  If this module is not found or
72       the page cannot be fetched, the constructor will croak().
73

PARSER METHODS

75   get_token
76       This method will return the next token that
77       HTML::TokeParser::get_token() method would return.  However, it will be
78       blessed into a class appropriate which represents the token type.
79
80   get_tag
81       This method will return the next token that HTML::TokeParser::get_tag()
82       method would return.  However, it will be blessed into either the
83       HTML::TokeParser::Simple::Token::Tag::Start or
84       HTML::TokeParser::Simple::Token::Tag::End class.
85
86   peek
87       As of version 3.14, you can now peek() at the upcomings tokens without
88       affecting the state of the parser.  By default, peek() will return the
89       text of the next token, but specifying an integer $count will return
90       the text of the next $count tokens.
91
92       This is useful when you're trying to debug where you are in a document.
93
94        warn $parser->peek(3); # show the next 3 tokens
95

ACCESSORS

97       The following methods may be called on the token object which is
98       returned, not on the parser object.
99
100   Boolean Accessors
101       These accessors return true or false.
102
103       •   is_tag([$tag])
104
105           Use this to determine if you have any tag.  An optional "tag type"
106           may be passed.  This will allow you to match if it's a particular
107           tag.  The supplied tag is case-insensitive.
108
109            if ( $token->is_tag ) { ... }
110
111           Optionally, you may pass a regular expression as an argument.
112
113       •   is_start_tag([$tag])
114
115           Use this to determine if you have a start tag.  An optional "tag
116           type" may be passed.  This will allow you to match if it's a
117           particular start tag.  The supplied tag is case-insensitive.
118
119            if ( $token->is_start_tag ) { ... }
120            if ( $token->is_start_tag( 'font' ) ) { ... }
121
122           Optionally, you may pass a regular expression as an argument.  To
123           match all header (h1, h2, ... h6) tags:
124
125            if ( $token->is_start_tag( qr/^h[123456]$/ ) ) { ... }
126
127       •   is_end_tag([$tag])
128
129           Use this to determine if you have an end tag.  An optional "tag
130           type" may be passed.  This will allow you to match if it's a
131           particular end tag.  The supplied tag is case-insensitive.
132
133           When testing for an end tag, the forward slash on the tag is
134           optional.
135
136            while ( $token = $p->get_token ) {
137              if ( $token->is_end_tag( 'form' ) ) { ... }
138            }
139
140           Or:
141
142            while ( $token = $p->get_token ) {
143              if ( $token->is_end_tag( '/form' ) ) { ... }
144            }
145
146           Optionally, you may pass a regular expression as an argument.
147
148       •   is_text()
149
150           Use this to determine if you have text.  Note that this is not to
151           be confused with the "return_text" (deprecated) method described
152           below!  "is_text" will identify text that the user typically sees
153           display in the Web browser.
154
155       •   is_comment()
156
157           Are you still reading this?  Nobody reads POD.  Don't you know
158           you're supposed to go to CLPM, ask a question that's answered in
159           the POD and get flamed?  It's a rite of passage.
160
161           Really.
162
163           "is_comment" is used to identify comments.  See the HTML::Parser
164           documentation for more information about comments.  There's more
165           than you might think.
166
167       •   is_declaration()
168
169           This will match the DTD at the top of your HTML. (You do use DTD's,
170           don't you?)
171
172       •   is_process_instruction()
173
174           Process Instructions are from XML.  This is very handy if you need
175           to parse out PHP and similar things with a parser.
176
177           Currently, there appear to be some problems with process
178           instructions.  You can override
179           "HTML::TokeParser::Simple::Token::ProcessInstruction" if you need
180           to.
181
182       •   is_pi()
183
184           This is a shorthand for is_process_instruction().
185
186   Data Accessors
187       Some of these were originally "return_" methods, but that name was not
188       only unwieldy, but also went against reasonable conventions.  The
189       "get_" methods listed below still have "return_" methods available for
190       backwards compatibility reasons, but they merely call their "get_"
191       counterpart.  For example, calling return_tag() actually calls
192       get_tag() internally.
193
194       •   get_tag()
195
196           Do you have a start tag or end tag?  This will return the type
197           (lower case).  Note that this is not the same as the get_tag()
198           method on the actual parser object.
199
200       •   get_attr([$attribute])
201
202           If you have a start tag, this will return a hash ref with the
203           attribute names as keys and the values as the values.
204
205           If you pass in an attribute name, it will return the value for just
206           that attribute.
207
208           Returns false if the token is not a start tag.
209
210       •   get_attrseq()
211
212           For a start tag, this is an array reference with the sequence of
213           the attributes, if any.
214
215           Returns false if the token is not a start tag.
216
217       •   return_text()
218
219           This method has been heavily deprecated (for a couple of years) in
220           favor of "as_is".  Programmers were getting confused over the
221           difference between "is_text", "return_text", and some parser
222           methods such as "HTML::TokeParser::get_text" and friends.
223
224           Using this method still succeeds, but will now carp and will be
225           removed in the next major release of this module.
226
227       •   as_is()
228
229           This is the exact text of whatever the token is representing.
230
231       •   get_token0()
232
233           For processing instructions, this will return the token found
234           immediately after the opening tag.  Example:  For <?php, "php" will
235           be the start of the returned string.
236
237           Note that process instruction handling appears to be incomplete in
238           "HTML::TokeParser".
239
240           Returns false if the token is not a process instruction.
241

MUTATORS

243       The delete_attr() and set_attr() methods allow the programmer to
244       rewrite start tag attributes on the fly.  It should be noted that bad
245       HTML will be "corrected" by this.  Specifically, the new tag will have
246       all attributes lower-cased with the values properly quoted.
247
248       Self-closing tags (e.g. <hr />) are also handled correctly.  Some older
249       browsers require a space prior to the final slash in a self-closed tag.
250       If such a space is detected in the original HTML, it will be preserved.
251
252       Calling a mutator on an token type that does not support that property
253       is a no-op.  For example:
254
255        if ($token->is_comment) {
256           $token->set_attr(foo => 'bar'); # does nothing
257        }
258
259       •   delete_attr($name)
260
261           This method attempts to delete the attribute specified.  It will
262           silently fail if called on anything other than a start tag.  The
263           argument is case-insensitive, but must otherwise be an exact match
264           of the attribute you are attempting to delete.  If the attribute is
265           not found, the method will return without changing the tag.
266
267            # <body bgcolor="#FFFFFF">
268            $token->delete_attr('bgcolor');
269            print $token->as_is;
270            # <body>
271
272           After this method is called, if successful, the as_is(), get_attr()
273           and get_attrseq() methods will all return updated results.
274
275       •   "set_attr($name,$value)"
276
277           This method will set the value of an attribute.  If the attribute
278           is not found, then get_attrseq() will have the new attribute listed
279           at the end.
280
281            # <p>
282            $token->set_attr(class => 'some_class');
283            print $token->as_is;
284            # <p class="some_class">
285
286            # <body bgcolor="#FFFFFF">
287            $token->set_attr('bgcolor','red');
288            print $token->as_is;
289            # <body bgcolor="red">
290
291           After this method is called, if successful, the as_is(), get_attr()
292           and get_attrseq() methods will all return updated results.
293
294       •   set_attr($hashref)
295
296           Under the premise that "set_" methods should accept what their
297           corresponding "get_" methods emit, the following works:
298
299             $tag->set_attr($tag->get_attr);
300
301           Theoretically that's a no-op and for purposes of rendering HTML, it
302           should be.  However, internally this calls "$tag->rewrite_tag", so
303           see that method to understand how this may affect you.
304
305           Of course, this is useless if you want to actually change the
306           attributes, so you can do this:
307
308             my $attrs = {
309               class  => 'headline',
310               valign => 'top'
311             };
312             $token->set_attr($attrs)
313               if $token->is_start_tag('td') &&  $token->get_attr('class') eq 'stories';
314
315       •   rewrite_tag()
316
317           This method rewrites the tag.  The tag name and the name of all
318           attributes will be lower-cased.  Values that are not quoted with
319           double quotes will be.  This may be called on both start or end
320           tags.  Note that both set_attr() and delete_attr() call this method
321           prior to returning.
322
323           If called on a token that is not a tag, it simply returns.
324           Regardless of how it is called, it returns the token.
325
326            # <body alink=#0000ff BGCOLOR=#ffffff class='none'>
327            $token->rewrite_tag;
328            print $token->as_is;
329            # <body alink="#0000ff" bgcolor="#ffffff" class="none">
330
331           A quick cleanup of sloppy HTML is now the following:
332
333            my $parser = HTML::TokeParser::Simple->new( string => $ugly_html );
334            while (my $token = $parser->get_token) {
335                $token->rewrite_tag;
336                print $token->as_is;
337            }
338

PARSER VERSUS TOKENS

340       The parser returns tokens that are blessed into appropriate classes.
341       Some people get confused and try to call parser methods on tokens and
342       token methods on the parser.  To prevent this,
343       "HTML::TokeParser::Simple" versions 1.4 and above now bless all tokens
344       into appropriate token classes.  Please keep this in mind while using
345       this module (and many thanks to PodMaster
346       <http://www.perlmonks.org/index.pl?node_id=107642> for pointing out
347       this issue to me.)
348

EXAMPLES

350   Finding comments
351       For some strange reason, your Pointy-Haired Boss (PHB) is convinced
352       that the graphics department is making fun of him by embedding rude
353       things about him in HTML comments.  You need to get all HTML comments
354       from the HTML.
355
356        use strict;
357        use HTML::TokeParser::Simple;
358
359        my @html_docs = glob( "*.html" );
360
361        open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing: $!";
362
363        foreach my $doc ( @html_docs ) {
364            print "Processing $doc\n";
365            my $p = HTML::TokeParser::Simple->new( file => $doc );
366            while ( my $token = $p->get_token ) {
367                next unless $token->is_comment;
368                print PHB $token->as_is, "\n";
369            }
370        }
371
372        close PHB;
373
374   Stripping Comments
375       Uh oh.  Turns out that your PHB was right for a change.  Many of the
376       comments in the HTML weren't very polite.  Since your entire graphics
377       department was just fired, it falls on you need to strip those comments
378       from the HTML.
379
380        use strict;
381        use HTML::TokeParser::Simple;
382
383        my $new_folder = 'no_comment/';
384        my @html_docs  = glob( "*.html" );
385
386        foreach my $doc ( @html_docs ) {
387            print "Processing $doc\n";
388            my $new_file = "$new_folder$doc";
389
390            open PHB, "> $new_file" or die "Cannot open $new_file for writing: $!";
391
392            my $p = HTML::TokeParser::Simple->new( $file => doc );
393            while ( my $token = $p->get_token ) {
394                next if $token->is_comment;
395                print PHB $token->as_is;
396            }
397            close PHB;
398        }
399
400   Changing form tags
401       Your company was foo.com and now is bar.com.  Unfortunately, whoever
402       wrote your HTML decided to hardcode "http://www.foo.com/" into the
403       "action" attribute of the form tags.  You need to change it to
404       "http://www.bar.com/".
405
406        use strict;
407        use HTML::TokeParser::Simple;
408
409        my $new_folder = 'new_html/';
410        my @html_docs  = glob( "*.html" );
411
412        foreach my $doc ( @html_docs ) {
413            print "Processing $doc\n";
414            my $new_file = "$new_folder$doc";
415
416            open FILE, "> $new_file" or die "Cannot open $new_file for writing: $!";
417
418            my $p = HTML::TokeParser::Simple->new( file => $doc );
419            while ( my $token = $p->get_token ) {
420                if ( $token->is_start_tag('form') ) {
421                    my $action = $token->get_attr(action);
422                    $action =~ s/www\.foo\.com/www.bar.com/;
423                    $token->set_attr('action', $action);
424                }
425                print FILE $token->as_is;
426            }
427            close FILE;
428        }
429

CAVEATS

431       For compatibility reasons with "HTML::TokeParser", methods that return
432       references are violating encapsulation and altering the references
433       directly will alter the state of the object.  Subsequent calls to
434       rewrite_tag() can thus have unexpected results.  Do not alter these
435       references directly unless you are following behavior described in
436       these docs.  In the future, certain methods such as "get_attr",
437       "get_attrseq" and others may return a copy of the reference rather than
438       the original reference.  This behavior has not yet been changed in
439       order to maintain compatibility with previous versions of this module.
440       At the present time, your author is not aware of anyone taking
441       advantage of this "feature," but it's better to be safe than sorry.
442
443       Use of $HTML::Parser::VERSION which is less than 3.25 may result in
444       incorrect behavior as older versions do not always handle XHTML
445       correctly.  It is the programmer's responsibility to verify that the
446       behavior of this code matches the programmer's needs.
447
448       Note that "HTML::Parser" processes text in 512 byte chunks.  This
449       sometimes will cause strange behavior and cause text to be broken into
450       more than one token.  You can suppress this behavior with the following
451       command:
452
453        $p->unbroken_text( [$bool] );
454
455       See the "HTML::Parser" documentation and
456       http://www.perlmonks.org/index.pl?node_id=230667 for more information.
457

BUGS

459       There are no known bugs, but that's no guarantee.
460
461       Address bug reports and comments to: <eop_divo_sitruc@yahoo.com>.  When
462       sending bug reports, please provide the version of "HTML::Parser",
463       "HTML::TokeParser", "HTML::TokeParser::Simple", the version of Perl,
464       and the version of the operating system you are using.
465
466       Reverse the name to email the author.
467

SUBCLASSING

469       You may wish to change the behavior of this module.  You probably do
470       not want to subclass "HTML::TokeParser::Simple".  Instead, you'll want
471       to subclass one of the token classes.
472       "HTML::TokeParser::Simple::Token" is the base class for all tokens.
473       Global behavioral changes should go there.  Otherwise, see the
474       appropriate token class for the behavior you wish to alter.
475

COPYRIGHT

490       Copyright (c) 2004 by Curtis "Ovid" Poe.  All rights reserved.  This
491       program is free software; you may redistribute it and/or modify it
492       under the same terms as Perl itself
493

AUTHOR

495       Curtis "Ovid" Poe <eop_divo_sitruc@yahoo.com>
496
497       Reverse the name to email the author.
498
499
500
501perl v5.36.0                      2023-01-20       HTML::TokeParser::Simple(3)