1HTML::SimpleParse(3) User Contributed Perl Documentation HTML::SimpleParse(3)
2
3
4
6 HTML::SimpleParse - a bare-bones HTML parser
7
9 use HTML::SimpleParse;
10
11 # Parse the text into a simple tree
12 my $p = new HTML::SimpleParse( $html_text );
13 $p->output; # Output the HTML verbatim
14
15 $p->text( $new_text ); # Give it some new HTML to chew on
16 $p->parse # Parse the new HTML
17 $p->output;
18
19 my %attrs = HTML::SimpleParse->parse_args('A="xx" B=3');
20 # %attrs is now ('A' => 'xx', 'B' => '3')
21
23 This module is a simple HTML parser. It is similar in concept to
24 HTML::Parser, but it differs from HTML::TreeBuilder in a couple of
25 important ways.
26
27 First, HTML::TreeBuilder knows which tags can contain other tags, which
28 start tags have corresponding end tags, which tags can exist only in
29 the <HEAD> portion of the document, and so forth. HTML::SimpleParse
30 does not know any of these things. It just finds tags and text in the
31 HTML you give it, it does not care about the specific content of these
32 tags (though it does distiguish between different _types_ of tags, such
33 as comments, starting tags like <b>, ending tags like </b>, and so on).
34
35 Second, HTML::SimpleParse does not create a hierarchical tree of HTML
36 content, but rather a simple linear list. It does not pay any
37 attention to balancing start tags with corresponding end tags, or which
38 pairs of tags are inside other pairs of tags.
39
40 Because of these characteristics, you can make a very effective HTML
41 filter by sub-classing HTML::SimpleParse. For example, to remove all
42 comments from HTML:
43
44 package NoComment;
45 use HTML::SimpleParse;
46 @ISA = qw(HTML::SimpleParse);
47 sub output_comment {}
48
49 package main;
50 NoComment->new($some_html)->output;
51
52 Historically, I started the HTML::SimpleParse project in part because
53 of a misunderstanding about HTML::Parser's functionality. Many aspects
54 of these two modules actually overlap. I continue to maintain the
55 HTML::SimpleParse module because people seem to be depending on it, and
56 because beginners sometimes find HTML::SimpleParse to be simpler than
57 HTML::Parser's more powerful interface. People also seem to get a fair
58 amount of usage out of the parse_args() method directly.
59
60 Methods
61 • new
62
63 $p = new HTML::SimpleParse( $some_html );
64
65 Creates a new HTML::SimpleParse object. Optionally takes one
66 argument, a string containing some HTML with which to initialize
67 the object. If you give it a non-empty string, the HTML will be
68 parsed into a tree and ready for outputting.
69
70 Can also take a list of attributes, such as
71
72 $p = new HTML::SimpleParse( $some_html, 'fix_case' => -1);
73
74 See the parse_args() method below for an explanation of this
75 attribute.
76
77 • text
78
79 $text = $p->text;
80 $p->text( $new_text );
81
82 Get or set the contents of the HTML to be parsed.
83
84 • tree
85
86 foreach ($p->tree) { ... }
87
88 Returns a list of all the nodes in the tree, in case you want to
89 step through them manually or something. Each node in the tree is
90 an anonymous hash with (at least) three data members, $node->{type}
91 (is this a comment, a start tag, an end tag, etc.),
92 $node->{content} (all the text between the angle brackets,
93 verbatim), and $node->{offset} (number of bytes from the beginning
94 of the string).
95
96 The possible values of $node->{type} are "text", "starttag",
97 "endtag", "ssi", and "markup".
98
99 • parse
100
101 $p->parse;
102
103 Once an object has been initialized with some text, call $p->parse
104 and a tree will be created. After the tree is created, you can
105 call $p->output. If you feed some text to the new() method, parse
106 will be called automatically during your object's construction.
107
108 • parse_args
109
110 %hash = $p->parse_args( $arg_string );
111
112 This routine is handy for parsing the contents of an HTML tag into
113 key=value pairs. For instance:
114
115 $text = 'type=checkbox checked name=flavor value="chocolate or strawberry"';
116 %hash = $p->parse_args( $text );
117 # %hash is ( TYPE=>'checkbox', CHECKED=>undef, NAME=>'flavor',
118 # VALUE=>'chocolate or strawberry' )
119
120 Note that the position of the last m//g search on the string (the
121 value returned by Perl's pos() function) will be altered by the
122 parse_args function, so make sure you take that into account if (in
123 the above example) you do "$text =~ m/something/g".
124
125 The parse_args() method can be run as either an object method or as
126 a class method, i.e. as either $p->parse_args(...) or
127 HTML::SimpleParse->parse_args(...).
128
129 HTML attribute lists are supposed to be case-insensitive with
130 respect to attribute names. To achieve this behavior, parse_args()
131 respects the 'fix_case' flag, which can be set either as a package
132 global $FIX_CASE, or as a class member datum 'fix_case'. If set to
133 0, no case conversion is done. If set to 1, all keys are converted
134 to upper case. If set to -1, all keys are converted to lower case.
135 The default is 1, i.e. all keys are uppercased.
136
137 If an attribute takes no value (like "checked" in the above
138 example) then it will still have an entry in the returned hash, but
139 its value will be "undef". For example:
140
141 %hash = $p->parse_args('type=checkbox checked name=banana value=""');
142 # $hash{CHECKED} is undef, but $hash{VALUE} is ""
143
144 This method actually returns a list (not a hash), so duplicate
145 attributes and order will be preserved if you want them to be:
146
147 @hash = $p->parse_args("name=family value=gwen value=mom value=pop");
148 # @hash is qw(NAME family VALUE gwen VALUE mom VALUE pop)
149
150 • output
151
152 $p->output;
153
154 This will output the contents of the HTML, passing the real work
155 off to the output_text, output_comment, etc. functions. If you do
156 not override any of these methods, this module will output the
157 exact text that it parsed into a tree in the first place.
158
159 • get_output
160
161 print $p->get_output
162
163 Similar to $p->output(), but returns its result instead of printing
164 it.
165
166 • execute
167
168 foreach ($p->tree) {
169 print $p->execute($_);
170 }
171
172 Executes a single node in the HTML parse tree. Useful if you want
173 to loop through the nodes and output them individually.
174
175 The following methods do the actual outputting of the various parts of
176 the HTML. Override some of them if you want to change the way the HTML
177 is output. For instance, to strip comments from the HTML, override the
178 output_comment method like so:
179
180 # In subclass:
181 sub output_comment { } # Does nothing
182
183 • output_text
184
185 • output_comment
186
187 • output_endtag
188
189 • output_starttag
190
191 • output_markup
192
193 • output_ssi
194
196 Please do not assume that the interface here is stable. This is a
197 first pass, and I'm still trying to incorporate suggestions from the
198 community. If you employ this module somewhere, make doubly sure
199 before upgrading that none of your code breaks when you use the newer
200 version.
201
203 • Embedded >s are broken
204
205 Won't handle tags with embedded >s in them, like <input name=expr
206 value="x > y">. This will be fixed in a future version, probably
207 by using the parse_args method. Suggestions are welcome.
208
210 • extensibility
211
212 Based on a suggestion from Randy Harmon (thanks), I'd like to make
213 it easier for subclasses of SimpleParse to pick out other kinds of
214 HTML blocks, i.e. extend the set {text, comment, endtag, starttag,
215 markup, ssi} to include more members. Currently the only easy way
216 to do that is by overriding the "parse" method:
217
218 sub parse { # In subclass
219 my $self = $_[0];
220 $self->SUPER::parse(@_);
221 foreach ($self->tree) {
222 if ($_->{content} =~ m#^a\s+#i) {
223 $_->{type} = 'anchor_start';
224 }
225 }
226 }
227
228 sub output_anchor_start {
229 # Whatever you want...
230 }
231
232 Alternatively, this feature might be implemented by hanging
233 attatchments onto the parsing loop, like this:
234
235 my $parser = new SimpleParse( $html_text );
236 $regex = '<(a\s+.*?)>';
237 $parser->watch_for( 'anchor_start', $regex );
238
239 sub SimpleParse::output_anchor_start {
240 # Whatever you want...
241 }
242
243 I think I like that idea better. If you wanted to, you could make
244 a subclass with output_anchor_start as one of its methods, and put
245 the ->watch_for stuff in the constructor.
246
247 • reading from filehandles
248
249 It would be nice if you could initialize an object by giving it a
250 filehandle or filename instead of the text itself.
251
252 • tests
253
254 I need to write a few tests that run under "make test".
255
257 Ken Williams <ken@forum.swarthmore.edu>
258
260 Copyright 1998 Swarthmore College. All rights reserved.
261
262 This library is free software; you can redistribute it and/or modify it
263 under the same terms as Perl itself.
264
265
266
267perl v5.36.0 2023-01-20 HTML::SimpleParse(3)