HTML::Scrubber(3pm)

1HTML::Scrubber(3)     User Contributed Perl Documentation    HTML::Scrubber(3)
2
3
4

NAME

6       HTML::Scrubber - Perl extension for scrubbing/sanitizing HTML
7

VERSION

9       version 0.19
10

SYNOPSIS

12           use HTML::Scrubber;
13
14           my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] );
15           print $scrubber->scrub('<p><b>bold</b> <em>missing</em></p>');
16           # output is: <p><b>bold</b> </p>
17
18           # more complex input
19           my $html = q[
20           <style type="text/css"> BAD { background: #666; color: #666;} </style>
21           <script language="javascript"> alert("Hello, I am EVIL!");    </script>
22           <HR>
23               a   => <a href=1>link </a>
24               br  => <br>
25               b   => <B> bold </B>
26               u   => <U> UNDERLINE </U>
27           ];
28
29           print $scrubber->scrub($html);
30
31           $scrubber->deny( qw[ p b i u hr br ] );
32
33           print $scrubber->scrub($html);
34

DESCRIPTION

36       If you want to "scrub" or "sanitize" html input in a reliable and
37       flexible fashion, then this module is for you.
38
39       I wasn't satisfied with HTML::Sanitizer because it is based on
40       HTML::TreeBuilder, so I thought I'd write something similar that works
41       directly with HTML::Parser.
42

METHODS

44       First a note on documentation: just study the EXAMPLE below. It's all
45       the documentation you could need.
46
47       Also, be sure to read all the comments as well as How does it work?.
48
49       If you're new to perl, good luck to you.
50
51   new
52           my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] );
53
54       Build a new HTML::Scrubber.  The arguments are the initial values for
55       the following directives:-
56
57       ·   default
58
59       ·   allow
60
61       ·   deny
62
63       ·   rules
64
65       ·   process
66
67       ·   comment
68
69   comment
70           warn "comments are  ", $p->comment ? 'allowed' : 'not allowed';
71           $p->comment(0);  # off by default
72
73   process
74           warn "process instructions are  ", $p->process ? 'allowed' : 'not allowed';
75           $p->process(0);  # off by default
76
77   script
78           warn "script tags (and everything in between) are supressed"
79               if $p->script;      # off by default
80           $p->script( 0 || 1 );
81
82       ** Please note that this is implemented using HTML::Parser's
83       "ignore_elements" function, so if "script" is set to true, all script
84       tags encountered will be validated like all other tags.
85
86   style
87           warn "style tags (and everything in between) are supressed"
88               if $p->style;       # off by default
89           $p->style( 0 || 1 );
90
91       ** Please note that this is implemented using HTML::Parser's
92       "ignore_elements" function, so if "style" is set to true, all style
93       tags encountered will be validated like all other tags.
94
95   allow
96           $p->allow(qw[ t a g s ]);
97
98   deny
99           $p->deny(qw[ t a g s ]);
100
101   rules
102           $p->rules(
103               img => {
104                   src => qr{^(?!http://)}i, # only relative image links allowed
105                   alt => 1,                 # alt attribute allowed
106                   '*' => 0,                 # deny all other attributes
107               },
108               a => {
109                   href => sub { ... },      # check or adjust with a callback
110               },
111               b => 1,
112               ...
113           );
114
115       Updates a set of attribute rules. Each rule can be 1/0, a regular
116       expression or a callback. Values longer than 1 char are treated as
117       regexps. The callback is called with the following arguments: the
118       current object, tag name, attribute name, and attribute value; the
119       callback should return an empty list to drop the attribute, "undef" to
120       keep it without a value, or a new scalar value.
121
122   default
123           print "default is ", $p->default();
124           $p->default(1);      # allow tags by default
125           $p->default(
126               undef,           # don't change
127               {                # default attribute rules
128                   '*' => 1,    # allow attributes by default
129               }
130           );
131
132   scrub_file
133           $html = $scrubber->scrub_file('foo.html');   ## returns giant string
134           die "Eeek $!" unless defined $html;  ## opening foo.html may have failed
135           $scrubber->scrub_file('foo.html', 'new.html') or die "Eeek $!";
136           $scrubber->scrub_file('foo.html', *STDOUT)
137               or die "Eeek $!"
138                   if fileno STDOUT;
139
140   scrub
141           print $scrubber->scrub($html);  ## returns giant string
142           $scrubber->scrub($html, 'new.html') or die "Eeek $!";
143           $scrubber->scrub($html', *STDOUT)
144               or die "Eeek $!"
145                   if fileno STDOUT;
146
147       default handler, used by both "_scrub" and "_scrub_fh". Moved all the
148       common code (basically all of it) into a single routine for ease of
149       maintenance.
150
151       default handler, does the scrubbing if we're scrubbing out to a file.
152       Now calls "_scrub_str" and pushes that out to a file.
153
154       default handler, does the scrubbing if we're returning a giant string.
155       Now calls "_scrub_str" and appends that to the output string.
156

How does it work?

158       When a tag is encountered, HTML::Scrubber allows/denies the tag using
159       the explicit rule if one exists.
160
161       If no explicit rule exists, Scrubber applies the default rule.
162
163       If an explicit rule exists, but it's a simple rule(1), then the default
164       attribute rule is applied.
165
166   EXAMPLE
167           #!/usr/bin/perl -w
168           use HTML::Scrubber;
169           use strict;
170
171           my @allow = qw[ br hr b a ];
172
173           my @rules = (
174               script => 0,
175               img    => {
176                   src => qr{^(?!http://)}i,    # only relative image links allowed
177                   alt => 1,                    # alt attribute allowed
178                   '*' => 0,                    # deny all other attributes
179               },
180           );
181
182           my @default = (
183               0 =>                             # default rule, deny all tags
184                   {
185                   '*'    => 1,                             # default rule, allow all attributes
186                   'href' => qr{^(?:http|https|ftp)://}i,
187                   'src'  => qr{^(?:http|https|ftp)://}i,
188
189                   #   If your perl doesn't have qr
190                   #   just use a string with length greater than 1
191                   'cite'        => '(?i-xsm:^(?:http|https|ftp):)',
192                   'language'    => 0,
193                   'name'        => 1,                                 # could be sneaky, but hey ;)
194                   'onblur'      => 0,
195                   'onchange'    => 0,
196                   'onclick'     => 0,
197                   'ondblclick'  => 0,
198                   'onerror'     => 0,
199                   'onfocus'     => 0,
200                   'onkeydown'   => 0,
201                   'onkeypress'  => 0,
202                   'onkeyup'     => 0,
203                   'onload'      => 0,
204                   'onmousedown' => 0,
205                   'onmousemove' => 0,
206                   'onmouseout'  => 0,
207                   'onmouseover' => 0,
208                   'onmouseup'   => 0,
209                   'onreset'     => 0,
210                   'onselect'    => 0,
211                   'onsubmit'    => 0,
212                   'onunload'    => 0,
213                   'src'         => 0,
214                   'type'        => 0,
215                   }
216           );
217
218           my $scrubber = HTML::Scrubber->new();
219           $scrubber->allow(@allow);
220           $scrubber->rules(@rules);    # key/value pairs
221           $scrubber->default(@default);
222           $scrubber->comment(1);       # 1 allow, 0 deny
223
224           ## preferred way to create the same object
225           $scrubber = HTML::Scrubber->new(
226               allow   => \@allow,
227               rules   => \@rules,
228               default => \@default,
229               comment => 1,
230               process => 0,
231           );
232
233           require Data::Dumper, die Data::Dumper::Dumper($scrubber) if @ARGV;
234
235           my $it = q[
236               <?php   echo(" EVIL EVIL EVIL "); ?>    <!-- asdf -->
237               <hr>
238               <I FAKE="attribute" > IN ITALICS WITH FAKE="attribute" </I><br>
239               <B> IN BOLD </B><br>
240               <A NAME="evil">
241                   <A HREF="javascript:alert('die die die');">HREF=JAVA &lt;!&gt;</A>
242                   <br>
243                   <A HREF="image/bigone.jpg" ONMOUSEOVER="alert('die die die');">
244                       <IMG SRC="image/smallone.jpg" ALT="ONMOUSEOVER JAVASCRIPT">
245                   </A>
246               </A> <br>
247           ];
248
249           print "#original text", $/, $it, $/;
250           print
251               "#scrubbed text (default ", $scrubber->default(),    # no arguments returns the current value
252               " comment ", $scrubber->comment(), " process ", $scrubber->process(), " )", $/, $scrubber->scrub($it), $/;
253
254           $scrubber->default(1);                                   # allow all tags by default
255           $scrubber->comment(0);                                   # deny comments
256
257           print
258               "#scrubbed text (default ",
259               $scrubber->default(),
260               " comment ",
261               $scrubber->comment(),
262               " process ",
263               $scrubber->process(),
264               " )", $/,
265               $scrubber->scrub($it),
266               $/;
267
268           $scrubber->process(1);    # allow process instructions (dangerous)
269           $default[0] = 1;          # allow all tags by default
270           $default[1]->{'*'} = 0;   # deny all attributes by default
271           $scrubber->default(@default);    # set the default again
272
273           print
274               "#scrubbed text (default ",
275               $scrubber->default(),
276               " comment ",
277               $scrubber->comment(),
278               " process ",
279               $scrubber->process(),
280               " )", $/,
281               $scrubber->scrub($it),
282               $/;
283
284   FUN
285       If you have Test::Inline (and you've installed HTML::Scrubber), try
286
287           pod2test Scrubber.pm >scrubber.t
288           perl scrubber.t
289

VERSION REQUIREMENTS

296       As of version 0.14 I have added a perl minimum version requirement of
297       5.8. This is basically due to failures on the smokers perl 5.6
298       installations - which appears to be down to installation mechanisms and
299       requirements.
300
301       Since I don't want to spend the time supporting a version that is so
302       old (and may not work for reasons on UTF support etc), I have added a
303       "use 5.008;" to the main module.
304
305       If this is problematic I am very willing to accept patches to fix this
306       up, although I do not personally see a good reason to support a release
307       that has been obsolete for 13 years.
308

CONTRIBUTING

310       If you want to contribute to the development of this module, the code
311       is on GitHub <http://github.com/nigelm/html-scrubber>. You'll need a
312       perl environment with Dist::Zilla, and if you're just getting started,
313       there's some documentation on using Vagrant and Perlbrew here
314       <http://mrcaron.github.io/2015/03/06/Perl-CPAN-Pull-Request.html>.
315
316       There is now a ".perltidyrc" and a ".tidyallrc" file within the
317       repository for the standard perltidy settings used - I will apply these
318       before new releases.  Please do not let formatting prevent you from
319       sending in patches etc - this can be sorted out as part of the release
320       process.  Info on "tidyall" can be found at
321       <https://metacpan.org/pod/distribution/Code-TidyAll/bin/tidyall>.
322

AUTHORS

324       ·   Ruslan Zakirov <Ruslan.Zakirov@gmail.com>
325
326       ·   Nigel Metheringham <nigelm@cpan.org>
327
328       ·   D. H. <podmaster@cpan.org>
329

COPYRIGHT AND LICENSE

331       This software is copyright (c) 2018 by Ruslan Zakirov, Nigel
332       Metheringham, 2003-2004 D. H.
333
334       This is free software; you can redistribute it and/or modify it under
335       the same terms as the Perl 5 programming language system itself.
336

SUPPORT

338   Perldoc
339       You can find documentation for this module with the perldoc command.
340
341         perldoc HTML::Scrubber
342
343   Websites
344       The following websites have more information about this module, and may
345       be of help to you. As always, in addition to those websites please use
346       your favorite search engine to discover more resources.
347
348       ·   MetaCPAN
349
350           A modern, open-source CPAN search engine, useful to view POD in
351           HTML format.
352
353           <https://metacpan.org/release/HTML-Scrubber>
354
355       ·   Search CPAN
356
357           The default CPAN search engine, useful to view POD in HTML format.
358
359           <http://search.cpan.org/dist/HTML-Scrubber>
360
361       ·   RT: CPAN's Bug Tracker
362
363           The RT ( Request Tracker ) website is the default bug/issue
364           tracking system for CPAN.
365
366           <https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-Scrubber>
367
368       ·   AnnoCPAN
369
370           The AnnoCPAN is a website that allows community annotations of Perl
371           module documentation.
372
373           <http://annocpan.org/dist/HTML-Scrubber>
374
375       ·   CPAN Ratings
376
377           The CPAN Ratings is a website that allows community ratings and
378           reviews of Perl modules.
379
380           <http://cpanratings.perl.org/d/HTML-Scrubber>
381
382       ·   CPANTS
383
384           The CPANTS is a website that analyzes the Kwalitee ( code metrics )
385           of a distribution.
386
387           <http://cpants.cpanauthors.org/dist/HTML-Scrubber>
388
389       ·   CPAN Testers
390
391           The CPAN Testers is a network of smoke testers who run automated
392           tests on uploaded CPAN distributions.
393
394           <http://www.cpantesters.org/distro/H/HTML-Scrubber>
395
396       ·   CPAN Testers Matrix
397
398           The CPAN Testers Matrix is a website that provides a visual
399           overview of the test results for a distribution on various
400           Perls/platforms.
401
402           <http://matrix.cpantesters.org/?dist=HTML-Scrubber>
403
404       ·   CPAN Testers Dependencies
405
406           The CPAN Testers Dependencies is a website that shows a chart of
407           the test results of all dependencies for a distribution.
408
409           <http://deps.cpantesters.org/?module=HTML::Scrubber>
410
411   Bugs / Feature Requests
412       Please report any bugs or feature requests by email to
413       "bug-html-scrubber at rt.cpan.org", or through the web interface at
414       <https://rt.cpan.org/Public/Bug/Report.html?Queue=HTML-Scrubber>. You
415       will be automatically notified of any progress on the request by the
416       system.
417
418   Source Code
419       The code is open to the world, and available for you to hack on. Please
420       feel free to browse it and play with it, or whatever. If you want to
421       contribute patches, please send me a diff or prod me to pull from your
422       repository :)
423
424       <https://github.com/nigelm/html-scrubber>
425
426         git clone https://github.com/nigelm/html-scrubber.git
427

CONTRIBUTORS

429       ·   Andrei Vereha <avereha@gmail.com>
430
431       ·   Lee Johnson <lee@givengain.ch>
432
433       ·   Michael Caron <michael.r.caron@gmail.com>
434
435       ·   Michael Caron <mrcaron@users.noreply.github.com>
436
437       ·   Nigel Metheringham <nm9762github@muesli.org.uk>
438
439       ·   Paul Cochrane <paul@liekut.de>
440
441       ·   Ruslan Zakirov <ruz@bestpractical.com>
442
443       ·   Sergey Romanov <complefor@rambler.ru>
444
445       ·   vagrant <vagrant@precise64.(none)>
446
447
448
449perl v5.32.0                      2020-07-28                 HTML::Scrubber(3)