HTML::Restrict(3pm)

1HTML::Restrict(3)     User Contributed Perl Documentation    HTML::Restrict(3)
2
3
4

NAME

6       HTML::Restrict - Strip unwanted HTML tags and attributes
7

VERSION

9       version v3.0.1
10

SYNOPSIS

12           use HTML::Restrict;
13
14           my $hr = HTML::Restrict->new();
15
16           # use default rules to start with (strip away all HTML)
17           my $processed = $hr->process('  <b>i am bold</b>  ');
18
19           # $processed now equals: 'i am bold'
20
21           # Now, a less restrictive example:
22           $hr = HTML::Restrict->new(
23               rules => {
24                   b   => [],
25                   img => [qw( src alt / )]
26               }
27           );
28
29           my $html = q[<body><b>hello</b> <img src="pic.jpg" alt="me" id="test" /></body>];
30           $processed = $hr->process( $html );
31
32           # $processed now equals: <b>hello</b> <img src="pic.jpg" alt="me" />
33

DESCRIPTION

35       This module uses HTML::Parser to strip HTML from text in a restrictive
36       manner.  By default all HTML is restricted.  You may alter the default
37       behaviour by supplying your own tag rules.
38

CONSTRUCTOR AND STARTUP

40   new()
41       Creates and returns a new HTML::Restrict object.
42
43           my $hr = HTML::Restrict->new()
44
45       HTML::Restrict doesn't require any params to be passed to new.  If your
46       goal is to remove all HTML from text, then no further setup is
47       required.  Just pass your text to the process() method and you're done:
48
49           my $plain_text = $hr->process( $html );
50
51       If you need to set up specific rules, have a look at the params which
52       HTML::Restrict recognizes:
53
54       •   "rules => \%rules"
55
56           Sets the rules which will be used to process your data.  By default
57           all HTML tags are off limits.  Use this argument to define the HTML
58           elements and corresponding attributes you'd like to use.
59           Essentially, consider the default behaviour to be:
60
61               rules => {}
62
63           Rules should be passed as a HASHREF of allowed tags.  Each hash
64           value should represent the allowed attributes for the listed tag.
65           For example, if you want to allow a fair amount of HTML, you can
66           try something like this:
67
68               my %rules = (
69                   a       => [qw( href target )],
70                   b       => [],
71                   caption => [],
72                   center  => [],
73                   em      => [],
74                   i       => [],
75                   img     => [qw( alt border height width src style )],
76                   li      => [],
77                   ol      => [],
78                   p       => [qw(style)],
79                   span    => [qw(style)],
80                   strong  => [],
81                   sub     => [],
82                   sup     => [],
83                   table   => [qw( style border cellspacing cellpadding align )],
84                   tbody   => [],
85                   td      => [],
86                   tr      => [],
87                   u       => [],
88                   ul      => [],
89               );
90
91               my $hr = HTML::Restrict->new( rules => \%rules )
92
93           Or, to allow only bolded text:
94
95               my $hr = HTML::Restrict->new( rules => { b => [] } );
96
97           Allow bolded text, images and some (but not all) image attributes:
98
99               my %rules = (
100                   b   => [ ],
101                   img => [qw( src alt width height border / )
102               );
103               my $hr = HTML::Restrict->new( rules => \%rules );
104
105           Since HTML::Parser treats a closing slash as an attribute, you'll
106           need to add "/" to your list of allowed attributes if you'd like
107           your tags to retain closing slashes.  For example:
108
109               my $hr = HTML::Restrict->new( rules =>{ hr => [] } );
110               $hr->process( "<hr />"); # returns: <hr>
111
112               my $hr = HTML::Restrict->new( rules =>{ hr => [qw( / )] } );
113               $hr->process( "<hr />"); # returns: <hr />
114
115           HTML::Restrict strips away any tags and attributes which are not
116           explicitly allowed. It also rebuilds your explicitly allowed tags
117           and places their attributes in the order in which they appear in
118           your rules.
119
120           So, if you define the following rules:
121
122               my %rules = (
123                   ...
124                   img => [qw( src alt title width height id / )]
125                   ...
126               );
127
128           then your image tags will all be built like this:
129
130               <img src=".." alt="..." title="..." width="..." height="..." id=".." />
131
132           This gives you greater consistency in your tag layout.  If you
133           don't care about element order you don't need to pay any attention
134           to this, but you should be aware that your elements are being
135           reconstructed rather than just stripped down.
136
137           As of 2.1.0, you can also specify a regex to be tested against the
138           attribute value. This feature should be considered experimental for
139           the time being:
140
141               my $hr = HTML::Restrict->new(
142                   rules => {
143                       iframe => [
144                           qw( width height allowfullscreen ),
145                           {   src         => qr{^http://www\.youtube\.com},
146                               frameborder => qr{^(0|1)$},
147                           }
148                       ],
149                       img => [ qw( alt ), { src => qr{^/my/images/} }, ],
150                   },
151               );
152
153               my $html = '<img src="http://www.example.com/image.jpg" alt="Alt Text">';
154               my $processed = $hr->process( $html );
155
156               # $processed now equals: <img alt="Alt Text">
157
158           As of 2.3.0, the value to be tested against can also be a code
159           reference.  The code reference will be passed the value of the
160           attribute, and should return either a string to use for the
161           attribute value, or undef to remove the attribute.
162
163               my $hr = HTML::Restrict->new(
164                   rules => {
165                       span => [
166                           { style     => sub {
167                               my $value = shift;
168                               # all colors are orange
169                               $value =~ s/\bcolor\s*:\s*[^;]+/color: orange/g;
170                               return $value;
171                           } }
172                       ],
173                   },
174               );
175
176               my $html = '<span style="color: #0000ff;">This is blue</span>';
177               my $processed = $hr->process( $html );
178
179               # $processed now equals: <span style="color: orange;">
180
181       •   "trim => [0|1]"
182
183           By default all leading and trailing spaces will be removed when
184           text is processed.  Set this value to 0 in order to disable this
185           behaviour.
186
187       •   "uri_schemes => [undef, 'http', 'https', 'irc', ... ]"
188
189           As of version 1.0.3, URI scheme checking is performed on all href
190           and src tag attributes. The following schemes are allowed out of
191           the box.  No action is required on your part:
192
193               [ undef, 'http', 'https' ]
194
195           (undef represents relative URIs). These restrictions have been put
196           in place to prevent XSS in the form of:
197
198               <a href="javascript:alert(document.cookie)">click for cookie!</a>
199
200           See URI for more detailed info on scheme parsing.  If, for example,
201           you wanted to filter out every scheme barring SSL, you would do it
202           like this:
203
204               uri_schemes => ['https']
205
206           This feature is new in 1.0.3.  Previous to this, there was no
207           schema checking at all.  Moving forward, you'll need to whitelist
208           explicitly all URI schemas which are not supported by default.
209           This is in keeping with the whitelisting behaviour of this module
210           and is also the safest possible approach.  Keep in mind that
211           changes to uri_schemes are not additive, so you'll need to include
212           the defaults in any changes you make, should you wish to keep them:
213
214               # defaults + irc + mailto
215               uri_schemes => [ 'undef', 'http', 'https', 'irc', 'mailto' ]
216
217       •   allow_declaration => [0|1]
218
219           Set this value to true if you'd like to allow/preserve DOCTYPE
220           declarations in your content.  Useful when cleaning up your own
221           static files or templates. This feature is off by default.
222
223               my $html = q[<!doctype html><body>foo</body>];
224
225               my $hr = HTML::Restrict->new( allow_declaration => 1 );
226               $html = $hr->process( $html );
227               # $html is now: "<!doctype html>foo"
228
229       •   allow_comments => [0|1]
230
231           Set this value to true if you'd like to allow/preserve HTML
232           comments in your content.  Useful when cleaning up your own static
233           files or templates. This feature is off by default.
234
235               my $html = q[<body><!-- comments! -->foo</body>];
236
237               my $hr = HTML::Restrict->new( allow_comments => 1 );
238               $html = $hr->process( $html );
239               # $html is now: "<!-- comments! -->foo"
240
241       •   replace_img => [0|1|CodeRef]
242
243           Set the value to true if you'd like to have img tags replaced with
244           "[IMAGE: ...]" containing the alt attribute text.  If you set it to
245           a code reference, you can provide your own replacement (which may
246           even contain HTML).
247
248               sub replacer {
249                   my ($tagname, $attr, $text) = @_; # from HTML::Parser
250                   return qq{<a href="$attr->{src}">IMAGE: $attr->{alt}</a>};
251               }
252
253               my $hr = HTML::Restrict->new( replace_img => \&replacer );
254
255           This attribute will only take effect if the img tag is not included
256           in the allowed HTML.
257
258       •   strip_enclosed_content => [0|1]
259
260           The default behaviour up to 1.0.4 was to preserve the content
261           between script and style tags, even when the tags themselves were
262           being deleted.  So, you'd be left with a bunch of JavaScript or
263           CSS, just with the enclosing tags missing.  This is almost never
264           what you want, so starting at 1.0.5 the default will be to remove
265           any script or style info which is enclosed in these tags, unless
266           they have specifically been whitelisted in the rules.  This will be
267           a sane default when cleaning up content submitted via a web form.
268           However, if you're using HTML::Restrict to purge your own HTML you
269           can be more restrictive.
270
271               # strip the head section, in addition to JS and CSS
272               my $html = '<html><head>...</head><body>...<script>JS here</script>foo';
273
274               my $hr = HTML::Restrict->new(
275                   strip_enclosed_content => [ 'script', 'style', 'head' ]
276               );
277
278               $html = $hr->process( $html );
279               # $html is now '<html><body>...foo';
280
281           The caveat here is that HTML::Restrict will not try to fix broken
282           HTML. In the above example, if you have any opening script, style
283           or head tags which don't also include matching closing tags, all
284           following content will be stripped away, regardless of any parent
285           tags.
286
287           Keep in mind that changes to strip_enclosed_content are not
288           additive, so if you are adding additional tags you'll need to
289           include the entire list of tags whose enclosed content you'd like
290           to remove.  This feature strips script and style tags by default.
291
292       •   "filter_text => [0|1|CodeRef]"
293
294           By default all text will be filtered to fix any encoding problems
295           which may cause security issues. You may override the encoding
296           behaviour by providing your own anonymous sub to "filter_text".
297           This first and only argument to the sub is the text which needs to
298           be filtered. The sub should return a scalar containing the
299           transformed text.
300
301               filter_text => sub {
302                   my $text = shift;
303                   ... # transform text
304                   return $text;
305               },
306
307           You may also this value to 0 in order to disable this behaviour
308           entirely.  Please be advised this is a security risk. Use caution
309           when disabling this parameter or providing your own filter
310           function.
311

SUBROUTINES/METHODS

313   process( $html )
314       This is the method which does the real work.  It parses your data,
315       removes any tags and attributes which are not specifically allowed and
316       returns the resulting text.  Requires and returns a SCALAR.
317
318   get_rules
319       Accessor which returns a hash ref of the current rule set.
320
321   get_uri_schemes
322       Accessor which returns an array ref of the current valid uri schemes.
323

CAVEATS

325       Please note that all tag and attribute names passed via the rules param
326       must be supplied in lower case.
327
328           # correct
329           my $hr = HTML::Restrict->new( rules => { body => ['onload'] } );
330
331           # throws a fatal error
332           my $hr = HTML::Restrict->new( rules => { Body => ['onLoad'] } );
333

MOTIVATION

335       There are already several modules on the CPAN which accomplish much of
336       the same thing, but after doing a lot of poking around, I was unable to
337       find a solution with a simple setup which I was happy with.
338
339       The most common use case might be stripping HTML from user submitted
340       data completely or allowing just a few tags and attributes to be
341       displayed.  With the exception of URI scheme checking, this module
342       doesn't do any validation on the actual content of the tags or
343       attributes.  If this is a requirement, you can either mess with the
344       parser object, post-process the text yourself or have a look at one of
345       the more feature-rich modules in the SEE ALSO section below.
346
347       My aim here is to keep things easy and, hopefully, cover a lot of the
348       less complex use cases with just a few lines of code and some brief
349       documentation.  The idea is to be up and running quickly.
350

ACKNOWLEDGEMENTS

356       Thanks to Raybec Communications <http://www.raybec.com> for funding my
357       work on this module and for releasing it to the world.
358
359       Thanks also to the many other contributors.
360       <https://github.com/oalders/html-restrict/graphs/contributors>
361

AUTHOR

363       Olaf Alders <olaf@wundercounter.com>
364

COPYRIGHT AND LICENSE

366       This software is copyright (c) 2009 by Olaf Alders.
367
368       This is free software; you can redistribute it and/or modify it under
369       the same terms as the Perl 5 programming language system itself.
370
371
372
373perl v5.36.0                      2022-09-23                 HTML::Restrict(3)