HTML::Restrict(3pm)

1HTML::Restrict(3)     User Contributed Perl Documentation    HTML::Restrict(3)
2
3
4

NAME

6       HTML::Restrict - Strip unwanted HTML tags and attributes
7

VERSION

9       version v2.3.0
10

SYNOPSIS

12           use HTML::Restrict;
13
14           my $hr = HTML::Restrict->new();
15
16           # use default rules to start with (strip away all HTML)
17           my $processed = $hr->process('  <b>i am bold</b>  ');
18
19           # $processed now equals: 'i am bold'
20
21           # Now, a less restrictive example:
22           use HTML::Restrict;
23
24           my $hr = HTML::Restrict->new(
25               rules => {
26                   b   => [],
27                   img => [qw( src alt / )]
28               }
29           );
30
31           my $html = q[<body><b>hello</b> <img src="pic.jpg" alt="me" id="test" /></body>];
32           my $processed = $hr->process( $html );
33
34           # $processed now equals: <b>hello</b> <img src="pic.jpg" alt="me" />
35

DESCRIPTION

37       This module uses HTML::Parser to strip HTML from text in a restrictive
38       manner.  By default all HTML is restricted.  You may alter the default
39       behaviour by supplying your own tag rules.
40

CONSTRUCTOR AND STARTUP

42   new()
43       Creates and returns a new HTML::Restrict object.
44
45           my $hr = HTML::Restrict->new()
46
47       HTML::Restrict doesn't require any params to be passed to new.  If your
48       goal is to remove all HTML from text, then no further setup is
49       required.  Just pass your text to the process() method and you're done:
50
51           my $plain_text = $hr->process( $html );
52
53       If you need to set up specific rules, have a look at the params which
54       HTML::Restrict recognizes:
55
56       ·   "rules => \%rules"
57
58           Sets the rules which will be used to process your data.  By default
59           all HTML tags are off limits.  Use this argument to define the HTML
60           elements and corresponding attributes you'd like to use.
61           Essentially, consider the default behaviour to be:
62
63               rules => {}
64
65           Rules should be passed as a HASHREF of allowed tags.  Each hash
66           value should represent the allowed attributes for the listed tag.
67           For example, if you want to allow a fair amount of HTML, you can
68           try something like this:
69
70               my %rules = (
71                   a       => [qw( href target )],
72                   b       => [],
73                   caption => [],
74                   center  => [],
75                   em      => [],
76                   i       => [],
77                   img     => [qw( alt border height width src style )],
78                   li      => [],
79                   ol      => [],
80                   p       => [qw(style)],
81                   span    => [qw(style)],
82                   strong  => [],
83                   sub     => [],
84                   sup     => [],
85                   table   => [qw( style border cellspacing cellpadding align )],
86                   tbody   => [],
87                   td      => [],
88                   tr      => [],
89                   u       => [],
90                   ul      => [],
91               );
92
93               my $hr = HTML::Restrict->new( rules => \%rules )
94
95           Or, to allow only bolded text:
96
97               my $hr = HTML::Restrict->new( rules => { b => [] } );
98
99           Allow bolded text, images and some (but not all) image attributes:
100
101               my %rules = (
102                   b   => [ ],
103                   img => [qw( src alt width height border / )
104               );
105               my $hr = HTML::Restrict->new( rules => \%rules );
106
107           Since HTML::Parser treats a closing slash as an attribute, you'll
108           need to add "/" to your list of allowed attributes if you'd like
109           your tags to retain closing slashes.  For example:
110
111               my $hr = HTML::Restrict->new( rules =>{ hr => [] } );
112               $hr->process( "<hr />"); # returns: <hr>
113
114               my $hr = HTML::Restrict->new( rules =>{ hr => [qw( / )] } );
115               $hr->process( "<hr />"); # returns: <hr />
116
117           HTML::Restrict strips away any tags and attributes which are not
118           explicitly allowed. It also rebuilds your explicitly allowed tags
119           and places their attributes in the order in which they appear in
120           your rules.
121
122           So, if you define the following rules:
123
124               my %rules = (
125                   ...
126                   img => [qw( src alt title width height id / )]
127                   ...
128               );
129
130           then your image tags will all be built like this:
131
132               <img src=".." alt="..." title="..." width="..." height="..." id=".." />
133
134           This gives you greater consistency in your tag layout.  If you
135           don't care about element order you don't need to pay any attention
136           to this, but you should be aware that your elements are being
137           reconstructed rather than just stripped down.
138
139           As of 2.1.0, you can also specify a regex to be tested against the
140           attribute value. This feature should be considered experimental for
141           the time being:
142
143               my $hr = HTML::Restrict->new(
144                   rules => {
145                       iframe => [
146                           qw( width height allowfullscreen ),
147                           {   src         => qr{^http://www\.youtube\.com},
148                               frameborder => qr{^(0|1)$},
149                           }
150                       ],
151                       img => [ qw( alt ), { src => qr{^/my/images/} }, ],
152                   },
153               );
154
155               my $html = '<img src="http://www.example.com/image.jpg" alt="Alt Text">';
156               my $processed = $hr->process( $html );
157
158               # $processed now equals: <img alt="Alt Text">
159
160           As of 2.3.0, the value to be tested against can also be a code
161           reference.  The code reference will be passed the value of the
162           attribute, and should return either a string to use for the
163           attribute value, or undef to remove the attribute.
164
165               my $hr = HTML::Restrict->new(
166                   rules => {
167                       span => [
168                           { style     => sub {
169                               my $value = shift;
170                               # all colors are orange
171                               $value =~ s/\bcolor\s*:\s*[^;]+/color: orange/g;
172                               return $value;
173                           } }
174                       ],
175                   },
176               );
177
178               my $html = '<span style="color: #0000ff;">This is blue</span>';
179               my $processed = $hr->process( $html );
180
181               # $processed now equals: <span style="color: orange;">
182
183       ·   "trim => [0|1]"
184
185           By default all leading and trailing spaces will be removed when
186           text is processed.  Set this value to 0 in order to disable this
187           behaviour.
188
189       ·   "uri_schemes => [undef, 'http', 'https', 'irc', ... ]"
190
191           As of version 1.0.3, URI scheme checking is performed on all href
192           and src tag attributes. The following schemes are allowed out of
193           the box.  No action is required on your part:
194
195               [ undef, 'http', 'https' ]
196
197           (undef represents relative URIs). These restrictions have been put
198           in place to prevent XSS in the form of:
199
200               <a href="javascript:alert(document.cookie)">click for cookie!</a>
201
202           See URI for more detailed info on scheme parsing.  If, for example,
203           you wanted to filter out every scheme barring SSL, you would do it
204           like this:
205
206               uri_schemes => ['https']
207
208           This feature is new in 1.0.3.  Previous to this, there was no
209           schema checking at all.  Moving forward, you'll need to whitelist
210           explicitly all URI schemas which are not supported by default.
211           This is in keeping with the whitelisting behaviour of this module
212           and is also the safest possible approach.  Keep in mind that
213           changes to uri_schemes are not additive, so you'll need to include
214           the defaults in any changes you make, should you wish to keep them:
215
216               # defaults + irc + mailto
217               uri_schemes => [ 'undef', 'http', 'https', 'irc', 'mailto' ]
218
219       ·   allow_declaration => [0|1]
220
221           Set this value to true if you'd like to allow/preserve DOCTYPE
222           declarations in your content.  Useful when cleaning up your own
223           static files or templates. This feature is off by default.
224
225               my $html = q[<!doctype html><body>foo</body>];
226
227               my $hr = HTML::Restrict->new( allow_declaration => 1 );
228               $html = $hr->process( $html );
229               # $html is now: "<!doctype html>foo"
230
231       ·   allow_comments => [0|1]
232
233           Set this value to true if you'd like to allow/preserve HTML
234           comments in your content.  Useful when cleaning up your own static
235           files or templates. This feature is off by default.
236
237               my $html = q[<body><!-- comments! -->foo</body>];
238
239               my $hr = HTML::Restrict->new( allow_comments => 1 );
240               $html = $hr->process( $html );
241               # $html is now: "<!-- comments! -->foo"
242
243       ·   replace_img => [0|1|CodeRef]
244
245           Set the value to true if you'd like to have img tags replaced with
246           "[IMAGE: ...]" containing the alt attribute text.  If you set it to
247           a code reference, you can provide your own replacement (which may
248           even contain HTML).
249
250               sub replacer {
251                   my ($tagname, $attr, $text) = @_; # from HTML::Parser
252                   return qq{<a href="$attr->{src}">IMAGE: $attr->{alt}</a>};
253               }
254
255               my $hr = HTML::Restrict->new( replace_img => \&replacer );
256
257           This attribute will only take effect if the img tag is not included
258           in the allowed HTML.
259
260       ·   strip_enclosed_content => [0|1]
261
262           The default behaviour up to 1.0.4 was to preserve the content
263           between script and style tags, even when the tags themselves were
264           being deleted.  So, you'd be left with a bunch of JavaScript or
265           CSS, just with the enclosing tags missing.  This is almost never
266           what you want, so starting at 1.0.5 the default will be to remove
267           any script or style info which is enclosed in these tags, unless
268           they have specifically been whitelisted in the rules.  This will be
269           a sane default when cleaning up content submitted via a web form.
270           However, if you're using HTML::Restrict to purge your own HTML you
271           can be more restrictive.
272
273               # strip the head section, in addition to JS and CSS
274               my $html = '<html><head>...</head><body>...<script>JS here</script>foo';
275
276               my $hr = HTML::Restrict->new(
277                   strip_enclosed_content => [ 'script', 'style', 'head' ]
278               );
279
280               $html = $hr->process( $html );
281               # $html is now '<html><body>...foo';
282
283           The caveat here is that HTML::Restrict will not try to fix broken
284           HTML. In the above example, if you have any opening script, style
285           or head tags which don't also include matching closing tags, all
286           following content will be stripped away, regardless of any parent
287           tags.
288
289           Keep in mind that changes to strip_enclosed_content are not
290           additive, so if you are adding additional tags you'll need to
291           include the entire list of tags whose enclosed content you'd like
292           to remove.  This feature strips script and style tags by default.
293

SUBROUTINES/METHODS

295   process( $html )
296       This is the method which does the real work.  It parses your data,
297       removes any tags and attributes which are not specifically allowed and
298       returns the resulting text.  Requires and returns a SCALAR.
299
300   get_rules
301       Accessor which returns a hash ref of the current rule set.
302
303   get_uri_schemes
304       Accessor which returns an array ref of the current valid uri schemes.
305

CAVEATS

307       Please note that all tag and attribute names passed via the rules param
308       must be supplied in lower case.
309
310           # correct
311           my $hr = HTML::Restrict->new( rules => { body => ['onload'] } );
312
313           # throws a fatal error
314           my $hr = HTML::Restrict->new( rules => { Body => ['onLoad'] } );
315

MOTIVATION

317       There are already several modules on the CPAN which accomplish much of
318       the same thing, but after doing a lot of poking around, I was unable to
319       find a solution with a simple setup which I was happy with.
320
321       The most common use case might be stripping HTML from user submitted
322       data completely or allowing just a few tags and attributes to be
323       displayed.  With the exception of URI scheme checking, this module
324       doesn't do any validation on the actual content of the tags or
325       attributes.  If this is a requirement, you can either mess with the
326       parser object, post-process the text yourself or have a look at one of
327       the more feature-rich modules in the SEE ALSO section below.
328
329       My aim here is to keep things easy and, hopefully, cover a lot of the
330       less complex use cases with just a few lines of code and some brief
331       documentation.  The idea is to be up and running quickly.
332

ACKNOWLEDGEMENTS

338       Thanks to Raybec Communications <http://www.raybec.com> for funding my
339       work on this module and for releasing it to the world.
340
341       Thanks also to the following for patches, bug reports and assistance:
342
343       Mark Jubenville (ioncache)
344
345       Duncan Forsyth
346
347       Rick Moore
348
349       Arthur Axel 'fREW' Schmidt
350
351       perlpong
352
353       David Golden
354
355       Graham TerMarsch
356
357       Dagfinn Ilmari Mannsåker
358
359       Graham Knop
360
361       Carwyn Ellis
362

AUTHOR

364       Olaf Alders <olaf@wundercounter.com>
365

COPYRIGHT AND LICENSE

367       This software is copyright (c) 2013-2017 by Olaf Alders.
368
369       This is free software; you can redistribute it and/or modify it under
370       the same terms as the Perl 5 programming language system itself.
371
372
373
374perl v5.28.0                      2018-02-09                 HTML::Restrict(3)