HTML::SimpleParse(3pm)

1HTML::SimpleParse(3)  User Contributed Perl Documentation HTML::SimpleParse(3)
2
3
4

NAME

6       HTML::SimpleParse - a bare-bones HTML parser
7

SYNOPSIS

9        use HTML::SimpleParse;
10
11        # Parse the text into a simple tree
12        my $p = new HTML::SimpleParse( $html_text );
13        $p->output;                 # Output the HTML verbatim
14
15        $p->text( $new_text );      # Give it some new HTML to chew on
16        $p->parse                   # Parse the new HTML
17        $p->output;
18
19        my %attrs = HTML::SimpleParse->parse_args('A="xx" B=3');
20        # %attrs is now ('A' => 'xx', 'B' => '3')
21

DESCRIPTION

23       This module is a simple HTML parser.  It is similar in concept to
24       HTML::Parser, but it differs from HTML::TreeBuilder in a couple of
25       important ways.
26
27       First, HTML::TreeBuilder knows which tags can contain other tags, which
28       start tags have corresponding end tags, which tags can exist only in
29       the <HEAD> portion of the document, and so forth.  HTML::SimpleParse
30       does not know any of these things.  It just finds tags and text in the
31       HTML you give it, it does not care about the specific content of these
32       tags (though it does distiguish between different _types_ of tags, such
33       as comments, starting tags like <b>, ending tags like </b>, and so on).
34
35       Second, HTML::SimpleParse does not create a hierarchical tree of HTML
36       content, but rather a simple linear list.  It does not pay any
37       attention to balancing start tags with corresponding end tags, or which
38       pairs of tags are inside other pairs of tags.
39
40       Because of these characteristics, you can make a very effective HTML
41       filter by sub-classing HTML::SimpleParse.  For example, to remove all
42       comments from HTML:
43
44        package NoComment;
45        use HTML::SimpleParse;
46        @ISA = qw(HTML::SimpleParse);
47        sub output_comment {}
48
49        package main;
50        NoComment->new($some_html)->output;
51
52       Historically, I started the HTML::SimpleParse project in part because
53       of a misunderstanding about HTML::Parser's functionality.  Many aspects
54       of these two modules actually overlap.  I continue to maintain the
55       HTML::SimpleParse module because people seem to be depending on it, and
56       because beginners sometimes find HTML::SimpleParse to be simpler than
57       HTML::Parser's more powerful interface.  People also seem to get a fair
58       amount of usage out of the "parse_args()" method directly.
59
60   Methods
61       •   new
62
63            $p = new HTML::SimpleParse( $some_html );
64
65           Creates a new HTML::SimpleParse object.  Optionally takes one
66           argument, a string containing some HTML with which to initialize
67           the object.  If you give it a non-empty string, the HTML will be
68           parsed into a tree and ready for outputting.
69
70           Can also take a list of attributes, such as
71
72            $p = new HTML::SimpleParse( $some_html, 'fix_case' => -1);
73
74           See the "parse_args()" method below for an explanation of this
75           attribute.
76
77       •   text
78
79            $text = $p->text;
80            $p->text( $new_text );
81
82           Get or set the contents of the HTML to be parsed.
83
84       •   tree
85
86            foreach ($p->tree) { ... }
87
88           Returns a list of all the nodes in the tree, in case you want to
89           step through them manually or something.  Each node in the tree is
90           an anonymous hash with (at least) three data members, $node->{type}
91           (is this a comment, a start tag, an end tag, etc.),
92           $node->{content} (all the text between the angle brackets,
93           verbatim), and $node->{offset} (number of bytes from the beginning
94           of the string).
95
96           The possible values of $node->{type} are "text", "starttag",
97           "endtag", "ssi", and "markup".
98
99       •   parse
100
101            $p->parse;
102
103           Once an object has been initialized with some text, call $p->parse
104           and a tree will be created.  After the tree is created, you can
105           call $p->output.  If you feed some text to the new() method, parse
106           will be called automatically during your object's construction.
107
108       •   parse_args
109
110            %hash = $p->parse_args( $arg_string );
111
112           This routine is handy for parsing the contents of an HTML tag into
113           key=value pairs.  For instance:
114
115             $text = 'type=checkbox checked name=flavor value="chocolate or strawberry"';
116             %hash = $p->parse_args( $text );
117             # %hash is ( TYPE=>'checkbox', CHECKED=>undef, NAME=>'flavor',
118             #            VALUE=>'chocolate or strawberry' )
119
120           Note that the position of the last m//g search on the string (the
121           value returned by Perl's pos() function) will be altered by the
122           parse_args function, so make sure you take that into account if (in
123           the above example) you do "$text =~ m/something/g".
124
125           The parse_args() method can be run as either an object method or as
126           a class method, i.e. as either $p->parse_args(...) or
127           HTML::SimpleParse->parse_args(...).
128
129           HTML attribute lists are supposed to be case-insensitive with
130           respect to attribute names.  To achieve this behavior, parse_args()
131           respects the 'fix_case' flag, which can be set either as a package
132           global $FIX_CASE, or as a class member datum 'fix_case'.  If set to
133           0, no case conversion is done.  If set to 1, all keys are converted
134           to upper case.  If set to -1, all keys are converted to lower case.
135           The default is 1, i.e. all keys are uppercased.
136
137           If an attribute takes no value (like "checked" in the above
138           example) then it will still have an entry in the returned hash, but
139           its value will be "undef".  For example:
140
141             %hash = $p->parse_args('type=checkbox checked name=banana value=""');
142             # $hash{CHECKED} is undef, but $hash{VALUE} is ""
143
144           This method actually returns a list (not a hash), so duplicate
145           attributes and order will be preserved if you want them to be:
146
147            @hash = $p->parse_args("name=family value=gwen value=mom value=pop");
148            # @hash is qw(NAME family VALUE gwen VALUE mom VALUE pop)
149
150       •   output
151
152            $p->output;
153
154           This will output the contents of the HTML, passing the real work
155           off to the output_text, output_comment, etc. functions.  If you do
156           not override any of these methods, this module will output the
157           exact text that it parsed into a tree in the first place.
158
159       •   get_output
160
161            print $p->get_output
162
163           Similar to $p->output(), but returns its result instead of printing
164           it.
165
166       •   execute
167
168            foreach ($p->tree) {
169               print $p->execute($_);
170            }
171
172           Executes a single node in the HTML parse tree.  Useful if you want
173           to loop through the nodes and output them individually.
174
175       The following methods do the actual outputting of the various parts of
176       the HTML.  Override some of them if you want to change the way the HTML
177       is output.  For instance, to strip comments from the HTML, override the
178       output_comment method like so:
179
180        # In subclass:
181        sub output_comment { }  # Does nothing
182
183       •   output_text
184
185       •   output_comment
186
187       •   output_endtag
188
189       •   output_starttag
190
191       •   output_markup
192
193       •   output_ssi
194

CAVEATS

196       Please do not assume that the interface here is stable.  This is a
197       first pass, and I'm still trying to incorporate suggestions from the
198       community.  If you employ this module somewhere, make doubly sure
199       before upgrading that none of your code breaks when you use the newer
200       version.
201

BUGS

203       •   Embedded >s are broken
204
205           Won't handle tags with embedded >s in them, like <input name=expr
206           value="x > y">.  This will be fixed in a future version, probably
207           by using the parse_args method.  Suggestions are welcome.
208

TO DO

210       •   extensibility
211
212           Based on a suggestion from Randy Harmon (thanks), I'd like to make
213           it easier for subclasses of SimpleParse to pick out other kinds of
214           HTML blocks, i.e.  extend the set {text, comment, endtag, starttag,
215           markup, ssi} to include more members.  Currently the only easy way
216           to do that is by overriding the "parse" method:
217
218            sub parse {  # In subclass
219               my $self = $_[0];
220               $self->SUPER::parse(@_);
221               foreach ($self->tree) {
222                  if ($_->{content} =~ m#^a\s+#i) {
223                     $_->{type} = 'anchor_start';
224                  }
225               }
226            }
227
228            sub output_anchor_start {
229               # Whatever you want...
230            }
231
232           Alternatively, this feature might be implemented by hanging
233           attatchments onto the parsing loop, like this:
234
235            my $parser = new SimpleParse( $html_text );
236            $regex = '<(a\s+.*?)>';
237            $parser->watch_for( 'anchor_start', $regex );
238
239            sub SimpleParse::output_anchor_start {
240               # Whatever you want...
241            }
242
243           I think I like that idea better.  If you wanted to, you could make
244           a subclass with output_anchor_start as one of its methods, and put
245           the ->watch_for stuff in the constructor.
246
247       •   reading from filehandles
248
249           It would be nice if you could initialize an object by giving it a
250           filehandle or filename instead of the text itself.
251
252       •   tests
253
254           I need to write a few tests that run under "make test".
255

AUTHOR

257       Ken Williams <ken@forum.swarthmore.edu>
258

COPYRIGHT

260       Copyright 1998 Swarthmore College.  All rights reserved.
261
262       This library is free software; you can redistribute it and/or modify it
263       under the same terms as Perl itself.
264
265
266
267perl v5.32.1                      2021-01-27              HTML::SimpleParse(3)