HTML::TokeParser::Simple(3pm)

1HTML::TokeParser::SimplUes(e3r)Contributed Perl DocumentHaTtMiLo:n:TokeParser::Simple(3)
2
3
4

NAME

6       HTML::TokeParser::Simple - Easy to use "HTML::TokeParser" interface
7

SYNOPSIS

9        use HTML::TokeParser::Simple;
10        my $p = HTML::TokeParser::Simple->new( $somefile );
11
12        while ( my $token = $p->get_token ) {
13            # This prints all text in an HTML doc (i.e., it strips the HTML)
14            next unless $token->is_text;
15            print $token->as_is;
16        }
17

DESCRIPTION

19       "HTML::TokeParser" is an excellent module that's often used for parsing
20       HTML.  However, the tokens returned are not exactly intuitive to parse:
21
22        ["S",  $tag, $attr, $attrseq, $text]
23        ["E",  $tag, $text]
24        ["T",  $text, $is_data]
25        ["C",  $text]
26        ["D",  $text]
27        ["PI", $token0, $text]
28
29       To simplify this, "HTML::TokeParser::Simple" allows the user ask more
30       intuitive (read: more self-documenting) questions about the tokens
31       returned.
32
33       You can also rebuild some tags on the fly.  Frequently, the attributes
34       associated with start tags need to be altered, added to, or deleted.
35       This functionality is built in.
36
37       Since this is a subclass of "HTML::TokeParser", all "HTML::TokeParser"
38       methods are available.  To truly appreciate the power of this module,
39       please read the documentation for "HTML::TokeParser" and
40       "HTML::Parser".
41

CONTRUCTORS

43   "new($source)"
44       The constructor for "HTML::TokeParser::Simple" can be used just like
45       "HTML::TokeParser"'s constructor:
46
47         my $parser = HTML::TokeParser::Simple->new($filename);
48         # or
49         my $parser = HTML::TokeParser::Simple->new($filehandle);
50         # or
51         my $parser = HTML::TokeParser::Simple->new(\$html_string);
52
53   "new($source_type, $source)"
54       If you wish to be more explicit, there is a new style of constructor
55       available.
56
57         my $parser = HTML::TokeParser::Simple->new(file   => $filename);
58         # or
59         my $parser = HTML::TokeParser::Simple->new(handle => $filehandle);
60         # or
61         my $parser = HTML::TokeParser::Simple->new(string => $html_string);
62
63       Note that you do not have to provide a reference for the string if
64       using the string constructor.
65
66       As a convenience, you can also attempt to fetch the HTML directly from
67       a URL.
68
69         my $parser = HTML::TokeParser::Simple->new(url => 'http://some.url');
70
71       This method relies on "LWP::Simple".  If this module is not found or
72       the page cannot be fetched, the constructor will "croak()".
73

PARSER METHODS

75   get_token
76       This method will return the next token that
77       "HTML::TokeParser::get_token()" method would return.  However, it will
78       be blessed into a class appropriate which represents the token type.
79
80   get_tag
81       This method will return the next token that
82       "HTML::TokeParser::get_tag()" method would return.  However, it will be
83       blessed into either the HTML::TokeParser::Simple::Token::Tag::Start or
84       HTML::TokeParser::Simple::Token::Tag::End class.
85
86   peek
87       As of version 3.14, you can now "peek()" at the upcomings tokens
88       without affecting the state of the parser.  By default, "peek()" will
89       return the text of the next token, but specifying an integer $count
90       will return the text of the next $count tokens.
91
92       This is useful when you're trying to debug where you are in a document.
93
94        warn $parser->peek(3); # show the next 3 tokens
95

ACCESSORS

97       The following methods may be called on the token object which is
98       returned, not on the parser object.
99
100   Boolean Accessors
101       These accessors return true or false.
102
103       •   "is_tag([$tag])"
104
105           Use this to determine if you have any tag.  An optional "tag type"
106           may be passed.  This will allow you to match if it's a particular
107           tag.  The supplied tag is case-insensitive.
108
109            if ( $token->is_tag ) { ... }
110
111           Optionally, you may pass a regular expression as an argument.
112
113       •   "is_start_tag([$tag])"
114
115           Use this to determine if you have a start tag.  An optional "tag
116           type" may be passed.  This will allow you to match if it's a
117           particular start tag.  The supplied tag is case-insensitive.
118
119            if ( $token->is_start_tag ) { ... }
120            if ( $token->is_start_tag( 'font' ) ) { ... }
121
122           Optionally, you may pass a regular expression as an argument.  To
123           match all header (h1, h2, ... h6) tags:
124
125            if ( $token->is_start_tag( qr/^h[123456]$/ ) ) { ... }
126
127       •   "is_end_tag([$tag])"
128
129           Use this to determine if you have an end tag.  An optional "tag
130           type" may be passed.  This will allow you to match if it's a
131           particular end tag.  The supplied tag is case-insensitive.
132
133           When testing for an end tag, the forward slash on the tag is
134           optional.
135
136            while ( $token = $p->get_token ) {
137              if ( $token->is_end_tag( 'form' ) ) { ... }
138            }
139
140           Or:
141
142            while ( $token = $p->get_token ) {
143              if ( $token->is_end_tag( '/form' ) ) { ... }
144            }
145
146           Optionally, you may pass a regular expression as an argument.
147
148       •   "is_text()"
149
150           Use this to determine if you have text.  Note that this is not to
151           be confused with the "return_text" (deprecated) method described
152           below!  "is_text" will identify text that the user typically sees
153           display in the Web browser.
154
155       •   "is_comment()"
156
157           Are you still reading this?  Nobody reads POD.  Don't you know
158           you're supposed to go to CLPM, ask a question that's answered in
159           the POD and get flamed?  It's a rite of passage.
160
161           Really.
162
163           "is_comment" is used to identify comments.  See the HTML::Parser
164           documentation for more information about comments.  There's more
165           than you might think.
166
167       •   "is_declaration()"
168
169           This will match the DTD at the top of your HTML. (You do use DTD's,
170           don't you?)
171
172       •   "is_process_instruction()"
173
174           Process Instructions are from XML.  This is very handy if you need
175           to parse out PHP and similar things with a parser.
176
177           Currently, there appear to be some problems with process
178           instructions.  You can override
179           "HTML::TokeParser::Simple::Token::ProcessInstruction" if you need
180           to.
181
182       •   "is_pi()"
183
184           This is a shorthand for "is_process_instruction()".
185
186   Data Accessors
187       Some of these were originally "return_" methods, but that name was not
188       only unwieldy, but also went against reasonable conventions.  The
189       "get_" methods listed below still have "return_" methods available for
190       backwards compatibility reasons, but they merely call their "get_"
191       counterpart.  For example, calling "return_tag()" actually calls
192       "get_tag()" internally.
193
194       •   "get_tag()"
195
196           Do you have a start tag or end tag?  This will return the type
197           (lower case).  Note that this is not the same as the "get_tag()"
198           method on the actual parser object.
199
200       •   "get_attr([$attribute])"
201
202           If you have a start tag, this will return a hash ref with the
203           attribute names as keys and the values as the values.
204
205           If you pass in an attribute name, it will return the value for just
206           that attribute.
207
208           Returns false if the token is not a start tag.
209
210       •   "get_attrseq()"
211
212           For a start tag, this is an array reference with the sequence of
213           the attributes, if any.
214
215           Returns false if the token is not a start tag.
216
217       •   "return_text()"
218
219           This method has been heavily deprecated (for a couple of years) in
220           favor of "as_is".  Programmers were getting confused over the
221           difference between "is_text", "return_text", and some parser
222           methods such as "HTML::TokeParser::get_text" and friends.
223
224           Using this method still succeeds, but will now carp and will be
225           removed in the next major release of this module.
226
227       •   "as_is()"
228
229           This is the exact text of whatever the token is representing.
230
231       •   "get_token0()"
232
233           For processing instructions, this will return the token found
234           immediately after the opening tag.  Example:  For <?php, "php" will
235           be the start of the returned string.
236
237           Note that process instruction handling appears to be incomplete in
238           "HTML::TokeParser".
239
240           Returns false if the token is not a process instruction.
241

MUTATORS

243       The "delete_attr()" and "set_attr()" methods allow the programmer to
244       rewrite start tag attributes on the fly.  It should be noted that bad
245       HTML will be "corrected" by this.  Specifically, the new tag will have
246       all attributes lower-cased with the values properly quoted.
247
248       Self-closing tags (e.g. <hr />) are also handled correctly.  Some older
249       browsers require a space prior to the final slash in a self-closed tag.
250       If such a space is detected in the original HTML, it will be preserved.
251
252       Calling a mutator on an token type that does not support that property
253       is a no-op.  For example:
254
255        if ($token->is_comment) {
256           $token->set_attr(foo => 'bar'); # does nothing
257        }
258
259       •   "delete_attr($name)"
260
261           This method attempts to delete the attribute specified.  It will
262           silently fail if called on anything other than a start tag.  The
263           argument is case-insensitive, but must otherwise be an exact match
264           of the attribute you are attempting to delete.  If the attribute is
265           not found, the method will return without changing the tag.
266
267            # <body bgcolor="#FFFFFF">
268            $token->delete_attr('bgcolor');
269            print $token->as_is;
270            # <body>
271
272           After this method is called, if successful, the "as_is()",
273           "get_attr()" and "get_attrseq()" methods will all return updated
274           results.
275
276       •   "set_attr($name,$value)"
277
278           This method will set the value of an attribute.  If the attribute
279           is not found, then "get_attrseq()" will have the new attribute
280           listed at the end.
281
282            # <p>
283            $token->set_attr(class => 'some_class');
284            print $token->as_is;
285            # <p class="some_class">
286
287            # <body bgcolor="#FFFFFF">
288            $token->set_attr('bgcolor','red');
289            print $token->as_is;
290            # <body bgcolor="red">
291
292           After this method is called, if successful, the "as_is()",
293           "get_attr()" and "get_attrseq()" methods will all return updated
294           results.
295
296       •   "set_attr($hashref)"
297
298           Under the premise that "set_" methods should accept what their
299           corresponding "get_" methods emit, the following works:
300
301             $tag->set_attr($tag->get_attr);
302
303           Theoretically that's a no-op and for purposes of rendering HTML, it
304           should be.  However, internally this calls "$tag->rewrite_tag", so
305           see that method to understand how this may affect you.
306
307           Of course, this is useless if you want to actually change the
308           attributes, so you can do this:
309
310             my $attrs = {
311               class  => 'headline',
312               valign => 'top'
313             };
314             $token->set_attr($attrs)
315               if $token->is_start_tag('td') &&  $token->get_attr('class') eq 'stories';
316
317       •   "rewrite_tag()"
318
319           This method rewrites the tag.  The tag name and the name of all
320           attributes will be lower-cased.  Values that are not quoted with
321           double quotes will be.  This may be called on both start or end
322           tags.  Note that both "set_attr()" and "delete_attr()" call this
323           method prior to returning.
324
325           If called on a token that is not a tag, it simply returns.
326           Regardless of how it is called, it returns the token.
327
328            # <body alink=#0000ff BGCOLOR=#ffffff class='none'>
329            $token->rewrite_tag;
330            print $token->as_is;
331            # <body alink="#0000ff" bgcolor="#ffffff" class="none">
332
333           A quick cleanup of sloppy HTML is now the following:
334
335            my $parser = HTML::TokeParser::Simple->new( string => $ugly_html );
336            while (my $token = $parser->get_token) {
337                $token->rewrite_tag;
338                print $token->as_is;
339            }
340

PARSER VERSUS TOKENS

342       The parser returns tokens that are blessed into appropriate classes.
343       Some people get confused and try to call parser methods on tokens and
344       token methods on the parser.  To prevent this,
345       "HTML::TokeParser::Simple" versions 1.4 and above now bless all tokens
346       into appropriate token classes.  Please keep this in mind while using
347       this module (and many thanks to PodMaster
348       <http://www.perlmonks.org/index.pl?node_id=107642> for pointing out
349       this issue to me.)
350

EXAMPLES

352   Finding comments
353       For some strange reason, your Pointy-Haired Boss (PHB) is convinced
354       that the graphics department is making fun of him by embedding rude
355       things about him in HTML comments.  You need to get all HTML comments
356       from the HTML.
357
358        use strict;
359        use HTML::TokeParser::Simple;
360
361        my @html_docs = glob( "*.html" );
362
363        open PHB, "> phbreport.txt" or die "Cannot open phbreport for writing: $!";
364
365        foreach my $doc ( @html_docs ) {
366            print "Processing $doc\n";
367            my $p = HTML::TokeParser::Simple->new( file => $doc );
368            while ( my $token = $p->get_token ) {
369                next unless $token->is_comment;
370                print PHB $token->as_is, "\n";
371            }
372        }
373
374        close PHB;
375
376   Stripping Comments
377       Uh oh.  Turns out that your PHB was right for a change.  Many of the
378       comments in the HTML weren't very polite.  Since your entire graphics
379       department was just fired, it falls on you need to strip those comments
380       from the HTML.
381
382        use strict;
383        use HTML::TokeParser::Simple;
384
385        my $new_folder = 'no_comment/';
386        my @html_docs  = glob( "*.html" );
387
388        foreach my $doc ( @html_docs ) {
389            print "Processing $doc\n";
390            my $new_file = "$new_folder$doc";
391
392            open PHB, "> $new_file" or die "Cannot open $new_file for writing: $!";
393
394            my $p = HTML::TokeParser::Simple->new( $file => doc );
395            while ( my $token = $p->get_token ) {
396                next if $token->is_comment;
397                print PHB $token->as_is;
398            }
399            close PHB;
400        }
401
402   Changing form tags
403       Your company was foo.com and now is bar.com.  Unfortunately, whoever
404       wrote your HTML decided to hardcode "http://www.foo.com/" into the
405       "action" attribute of the form tags.  You need to change it to
406       "http://www.bar.com/".
407
408        use strict;
409        use HTML::TokeParser::Simple;
410
411        my $new_folder = 'new_html/';
412        my @html_docs  = glob( "*.html" );
413
414        foreach my $doc ( @html_docs ) {
415            print "Processing $doc\n";
416            my $new_file = "$new_folder$doc";
417
418            open FILE, "> $new_file" or die "Cannot open $new_file for writing: $!";
419
420            my $p = HTML::TokeParser::Simple->new( file => $doc );
421            while ( my $token = $p->get_token ) {
422                if ( $token->is_start_tag('form') ) {
423                    my $action = $token->get_attr(action);
424                    $action =~ s/www\.foo\.com/www.bar.com/;
425                    $token->set_attr('action', $action);
426                }
427                print FILE $token->as_is;
428            }
429            close FILE;
430        }
431

CAVEATS

433       For compatibility reasons with "HTML::TokeParser", methods that return
434       references are violating encapsulation and altering the references
435       directly will alter the state of the object.  Subsequent calls to
436       "rewrite_tag()" can thus have unexpected results.  Do not alter these
437       references directly unless you are following behavior described in
438       these docs.  In the future, certain methods such as "get_attr",
439       "get_attrseq" and others may return a copy of the reference rather than
440       the original reference.  This behavior has not yet been changed in
441       order to maintain compatibility with previous versions of this module.
442       At the present time, your author is not aware of anyone taking
443       advantage of this "feature," but it's better to be safe than sorry.
444
445       Use of $HTML::Parser::VERSION which is less than 3.25 may result in
446       incorrect behavior as older versions do not always handle XHTML
447       correctly.  It is the programmer's responsibility to verify that the
448       behavior of this code matches the programmer's needs.
449
450       Note that "HTML::Parser" processes text in 512 byte chunks.  This
451       sometimes will cause strange behavior and cause text to be broken into
452       more than one token.  You can suppress this behavior with the following
453       command:
454
455        $p->unbroken_text( [$bool] );
456
457       See the "HTML::Parser" documentation and
458       http://www.perlmonks.org/index.pl?node_id=230667 for more information.
459

BUGS

461       There are no known bugs, but that's no guarantee.
462
463       Address bug reports and comments to: <eop_divo_sitruc@yahoo.com>.  When
464       sending bug reports, please provide the version of "HTML::Parser",
465       "HTML::TokeParser", "HTML::TokeParser::Simple", the version of Perl,
466       and the version of the operating system you are using.
467
468       Reverse the name to email the author.
469

SUBCLASSING

471       You may wish to change the behavior of this module.  You probably do
472       not want to subclass "HTML::TokeParser::Simple".  Instead, you'll want
473       to subclass one of the token classes.
474       "HTML::TokeParser::Simple::Token" is the base class for all tokens.
475       Global behavioral changes should go there.  Otherwise, see the
476       appropriate token class for the behavior you wish to alter.
477

COPYRIGHT

492       Copyright (c) 2004 by Curtis "Ovid" Poe.  All rights reserved.  This
493       program is free software; you may redistribute it and/or modify it
494       under the same terms as Perl itself
495

AUTHOR

497       Curtis "Ovid" Poe <eop_divo_sitruc@yahoo.com>
498
499       Reverse the name to email the author.
500
501
502
503perl v5.34.0                      2022-01-21       HTML::TokeParser::Simple(3)