1HTML::Restrict(3) User Contributed Perl Documentation HTML::Restrict(3)
2
3
4
6 HTML::Restrict - Strip unwanted HTML tags and attributes
7
9 version v3.0.0
10
12 use HTML::Restrict;
13
14 my $hr = HTML::Restrict->new();
15
16 # use default rules to start with (strip away all HTML)
17 my $processed = $hr->process(' <b>i am bold</b> ');
18
19 # $processed now equals: 'i am bold'
20
21 # Now, a less restrictive example:
22 $hr = HTML::Restrict->new(
23 rules => {
24 b => [],
25 img => [qw( src alt / )]
26 }
27 );
28
29 my $html = q[<body><b>hello</b> <img src="pic.jpg" alt="me" id="test" /></body>];
30 $processed = $hr->process( $html );
31
32 # $processed now equals: <b>hello</b> <img src="pic.jpg" alt="me" />
33
35 This module uses HTML::Parser to strip HTML from text in a restrictive
36 manner. By default all HTML is restricted. You may alter the default
37 behaviour by supplying your own tag rules.
38
40 new()
41 Creates and returns a new HTML::Restrict object.
42
43 my $hr = HTML::Restrict->new()
44
45 HTML::Restrict doesn't require any params to be passed to new. If your
46 goal is to remove all HTML from text, then no further setup is
47 required. Just pass your text to the process() method and you're done:
48
49 my $plain_text = $hr->process( $html );
50
51 If you need to set up specific rules, have a look at the params which
52 HTML::Restrict recognizes:
53
54 · "rules => \%rules"
55
56 Sets the rules which will be used to process your data. By default
57 all HTML tags are off limits. Use this argument to define the HTML
58 elements and corresponding attributes you'd like to use.
59 Essentially, consider the default behaviour to be:
60
61 rules => {}
62
63 Rules should be passed as a HASHREF of allowed tags. Each hash
64 value should represent the allowed attributes for the listed tag.
65 For example, if you want to allow a fair amount of HTML, you can
66 try something like this:
67
68 my %rules = (
69 a => [qw( href target )],
70 b => [],
71 caption => [],
72 center => [],
73 em => [],
74 i => [],
75 img => [qw( alt border height width src style )],
76 li => [],
77 ol => [],
78 p => [qw(style)],
79 span => [qw(style)],
80 strong => [],
81 sub => [],
82 sup => [],
83 table => [qw( style border cellspacing cellpadding align )],
84 tbody => [],
85 td => [],
86 tr => [],
87 u => [],
88 ul => [],
89 );
90
91 my $hr = HTML::Restrict->new( rules => \%rules )
92
93 Or, to allow only bolded text:
94
95 my $hr = HTML::Restrict->new( rules => { b => [] } );
96
97 Allow bolded text, images and some (but not all) image attributes:
98
99 my %rules = (
100 b => [ ],
101 img => [qw( src alt width height border / )
102 );
103 my $hr = HTML::Restrict->new( rules => \%rules );
104
105 Since HTML::Parser treats a closing slash as an attribute, you'll
106 need to add "/" to your list of allowed attributes if you'd like
107 your tags to retain closing slashes. For example:
108
109 my $hr = HTML::Restrict->new( rules =>{ hr => [] } );
110 $hr->process( "<hr />"); # returns: <hr>
111
112 my $hr = HTML::Restrict->new( rules =>{ hr => [qw( / )] } );
113 $hr->process( "<hr />"); # returns: <hr />
114
115 HTML::Restrict strips away any tags and attributes which are not
116 explicitly allowed. It also rebuilds your explicitly allowed tags
117 and places their attributes in the order in which they appear in
118 your rules.
119
120 So, if you define the following rules:
121
122 my %rules = (
123 ...
124 img => [qw( src alt title width height id / )]
125 ...
126 );
127
128 then your image tags will all be built like this:
129
130 <img src=".." alt="..." title="..." width="..." height="..." id=".." />
131
132 This gives you greater consistency in your tag layout. If you
133 don't care about element order you don't need to pay any attention
134 to this, but you should be aware that your elements are being
135 reconstructed rather than just stripped down.
136
137 As of 2.1.0, you can also specify a regex to be tested against the
138 attribute value. This feature should be considered experimental for
139 the time being:
140
141 my $hr = HTML::Restrict->new(
142 rules => {
143 iframe => [
144 qw( width height allowfullscreen ),
145 { src => qr{^http://www\.youtube\.com},
146 frameborder => qr{^(0|1)$},
147 }
148 ],
149 img => [ qw( alt ), { src => qr{^/my/images/} }, ],
150 },
151 );
152
153 my $html = '<img src="http://www.example.com/image.jpg" alt="Alt Text">';
154 my $processed = $hr->process( $html );
155
156 # $processed now equals: <img alt="Alt Text">
157
158 As of 2.3.0, the value to be tested against can also be a code
159 reference. The code reference will be passed the value of the
160 attribute, and should return either a string to use for the
161 attribute value, or undef to remove the attribute.
162
163 my $hr = HTML::Restrict->new(
164 rules => {
165 span => [
166 { style => sub {
167 my $value = shift;
168 # all colors are orange
169 $value =~ s/\bcolor\s*:\s*[^;]+/color: orange/g;
170 return $value;
171 } }
172 ],
173 },
174 );
175
176 my $html = '<span style="color: #0000ff;">This is blue</span>';
177 my $processed = $hr->process( $html );
178
179 # $processed now equals: <span style="color: orange;">
180
181 · "trim => [0|1]"
182
183 By default all leading and trailing spaces will be removed when
184 text is processed. Set this value to 0 in order to disable this
185 behaviour.
186
187 · "uri_schemes => [undef, 'http', 'https', 'irc', ... ]"
188
189 As of version 1.0.3, URI scheme checking is performed on all href
190 and src tag attributes. The following schemes are allowed out of
191 the box. No action is required on your part:
192
193 [ undef, 'http', 'https' ]
194
195 (undef represents relative URIs). These restrictions have been put
196 in place to prevent XSS in the form of:
197
198 <a href="javascript:alert(document.cookie)">click for cookie!</a>
199
200 See URI for more detailed info on scheme parsing. If, for example,
201 you wanted to filter out every scheme barring SSL, you would do it
202 like this:
203
204 uri_schemes => ['https']
205
206 This feature is new in 1.0.3. Previous to this, there was no
207 schema checking at all. Moving forward, you'll need to whitelist
208 explicitly all URI schemas which are not supported by default.
209 This is in keeping with the whitelisting behaviour of this module
210 and is also the safest possible approach. Keep in mind that
211 changes to uri_schemes are not additive, so you'll need to include
212 the defaults in any changes you make, should you wish to keep them:
213
214 # defaults + irc + mailto
215 uri_schemes => [ 'undef', 'http', 'https', 'irc', 'mailto' ]
216
217 · allow_declaration => [0|1]
218
219 Set this value to true if you'd like to allow/preserve DOCTYPE
220 declarations in your content. Useful when cleaning up your own
221 static files or templates. This feature is off by default.
222
223 my $html = q[<!doctype html><body>foo</body>];
224
225 my $hr = HTML::Restrict->new( allow_declaration => 1 );
226 $html = $hr->process( $html );
227 # $html is now: "<!doctype html>foo"
228
229 · allow_comments => [0|1]
230
231 Set this value to true if you'd like to allow/preserve HTML
232 comments in your content. Useful when cleaning up your own static
233 files or templates. This feature is off by default.
234
235 my $html = q[<body><!-- comments! -->foo</body>];
236
237 my $hr = HTML::Restrict->new( allow_comments => 1 );
238 $html = $hr->process( $html );
239 # $html is now: "<!-- comments! -->foo"
240
241 · replace_img => [0|1|CodeRef]
242
243 Set the value to true if you'd like to have img tags replaced with
244 "[IMAGE: ...]" containing the alt attribute text. If you set it to
245 a code reference, you can provide your own replacement (which may
246 even contain HTML).
247
248 sub replacer {
249 my ($tagname, $attr, $text) = @_; # from HTML::Parser
250 return qq{<a href="$attr->{src}">IMAGE: $attr->{alt}</a>};
251 }
252
253 my $hr = HTML::Restrict->new( replace_img => \&replacer );
254
255 This attribute will only take effect if the img tag is not included
256 in the allowed HTML.
257
258 · strip_enclosed_content => [0|1]
259
260 The default behaviour up to 1.0.4 was to preserve the content
261 between script and style tags, even when the tags themselves were
262 being deleted. So, you'd be left with a bunch of JavaScript or
263 CSS, just with the enclosing tags missing. This is almost never
264 what you want, so starting at 1.0.5 the default will be to remove
265 any script or style info which is enclosed in these tags, unless
266 they have specifically been whitelisted in the rules. This will be
267 a sane default when cleaning up content submitted via a web form.
268 However, if you're using HTML::Restrict to purge your own HTML you
269 can be more restrictive.
270
271 # strip the head section, in addition to JS and CSS
272 my $html = '<html><head>...</head><body>...<script>JS here</script>foo';
273
274 my $hr = HTML::Restrict->new(
275 strip_enclosed_content => [ 'script', 'style', 'head' ]
276 );
277
278 $html = $hr->process( $html );
279 # $html is now '<html><body>...foo';
280
281 The caveat here is that HTML::Restrict will not try to fix broken
282 HTML. In the above example, if you have any opening script, style
283 or head tags which don't also include matching closing tags, all
284 following content will be stripped away, regardless of any parent
285 tags.
286
287 Keep in mind that changes to strip_enclosed_content are not
288 additive, so if you are adding additional tags you'll need to
289 include the entire list of tags whose enclosed content you'd like
290 to remove. This feature strips script and style tags by default.
291
293 process( $html )
294 This is the method which does the real work. It parses your data,
295 removes any tags and attributes which are not specifically allowed and
296 returns the resulting text. Requires and returns a SCALAR.
297
298 get_rules
299 Accessor which returns a hash ref of the current rule set.
300
301 get_uri_schemes
302 Accessor which returns an array ref of the current valid uri schemes.
303
305 Please note that all tag and attribute names passed via the rules param
306 must be supplied in lower case.
307
308 # correct
309 my $hr = HTML::Restrict->new( rules => { body => ['onload'] } );
310
311 # throws a fatal error
312 my $hr = HTML::Restrict->new( rules => { Body => ['onLoad'] } );
313
315 There are already several modules on the CPAN which accomplish much of
316 the same thing, but after doing a lot of poking around, I was unable to
317 find a solution with a simple setup which I was happy with.
318
319 The most common use case might be stripping HTML from user submitted
320 data completely or allowing just a few tags and attributes to be
321 displayed. With the exception of URI scheme checking, this module
322 doesn't do any validation on the actual content of the tags or
323 attributes. If this is a requirement, you can either mess with the
324 parser object, post-process the text yourself or have a look at one of
325 the more feature-rich modules in the SEE ALSO section below.
326
327 My aim here is to keep things easy and, hopefully, cover a lot of the
328 less complex use cases with just a few lines of code and some brief
329 documentation. The idea is to be up and running quickly.
330
332 HTML::TagFilter, HTML::Defang, MojoMojo::Declaw, HTML::StripScripts,
333 HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
334
336 Thanks to Raybec Communications <http://www.raybec.com> for funding my
337 work on this module and for releasing it to the world.
338
339 Thanks also to the following for patches, bug reports and assistance:
340
341 Mark Jubenville (ioncache)
342
343 Duncan Forsyth
344
345 Rick Moore
346
347 Arthur Axel 'fREW' Schmidt
348
349 perlpong
350
351 David Golden
352
353 Graham TerMarsch
354
355 Dagfinn Ilmari Mannsåker
356
357 Graham Knop
358
359 Carwyn Ellis
360
362 Olaf Alders <olaf@wundercounter.com>
363
365 This software is copyright (c) 2013-2017 by Olaf Alders.
366
367 This is free software; you can redistribute it and/or modify it under
368 the same terms as the Perl 5 programming language system itself.
369
370
371
372perl v5.32.0 2020-07-28 HTML::Restrict(3)