1HTML::Scrubber(3) User Contributed Perl Documentation HTML::Scrubber(3)
2
3
4
6 HTML::Scrubber - Perl extension for scrubbing/sanitizing HTML
7
9 version 0.19
10
12 use HTML::Scrubber;
13
14 my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] );
15 print $scrubber->scrub('<p><b>bold</b> <em>missing</em></p>');
16 # output is: <p><b>bold</b> </p>
17
18 # more complex input
19 my $html = q[
20 <style type="text/css"> BAD { background: #666; color: #666;} </style>
21 <script language="javascript"> alert("Hello, I am EVIL!"); </script>
22 <HR>
23 a => <a href=1>link </a>
24 br => <br>
25 b => <B> bold </B>
26 u => <U> UNDERLINE </U>
27 ];
28
29 print $scrubber->scrub($html);
30
31 $scrubber->deny( qw[ p b i u hr br ] );
32
33 print $scrubber->scrub($html);
34
36 If you want to "scrub" or "sanitize" html input in a reliable and
37 flexible fashion, then this module is for you.
38
39 I wasn't satisfied with HTML::Sanitizer because it is based on
40 HTML::TreeBuilder, so I thought I'd write something similar that works
41 directly with HTML::Parser.
42
44 First a note on documentation: just study the EXAMPLE below. It's all
45 the documentation you could need.
46
47 Also, be sure to read all the comments as well as How does it work?.
48
49 If you're new to perl, good luck to you.
50
51 new
52 my $scrubber = HTML::Scrubber->new( allow => [ qw[ p b i u hr br ] ] );
53
54 Build a new HTML::Scrubber. The arguments are the initial values for
55 the following directives:-
56
57 · default
58
59 · allow
60
61 · deny
62
63 · rules
64
65 · process
66
67 · comment
68
69 comment
70 warn "comments are ", $p->comment ? 'allowed' : 'not allowed';
71 $p->comment(0); # off by default
72
73 process
74 warn "process instructions are ", $p->process ? 'allowed' : 'not allowed';
75 $p->process(0); # off by default
76
77 script
78 warn "script tags (and everything in between) are supressed"
79 if $p->script; # off by default
80 $p->script( 0 || 1 );
81
82 ** Please note that this is implemented using HTML::Parser's
83 "ignore_elements" function, so if "script" is set to true, all script
84 tags encountered will be validated like all other tags.
85
86 style
87 warn "style tags (and everything in between) are supressed"
88 if $p->style; # off by default
89 $p->style( 0 || 1 );
90
91 ** Please note that this is implemented using HTML::Parser's
92 "ignore_elements" function, so if "style" is set to true, all style
93 tags encountered will be validated like all other tags.
94
95 allow
96 $p->allow(qw[ t a g s ]);
97
98 deny
99 $p->deny(qw[ t a g s ]);
100
101 rules
102 $p->rules(
103 img => {
104 src => qr{^(?!http://)}i, # only relative image links allowed
105 alt => 1, # alt attribute allowed
106 '*' => 0, # deny all other attributes
107 },
108 a => {
109 href => sub { ... }, # check or adjust with a callback
110 },
111 b => 1,
112 ...
113 );
114
115 Updates a set of attribute rules. Each rule can be 1/0, a regular
116 expression or a callback. Values longer than 1 char are treated as
117 regexps. The callback is called with the following arguments: the
118 current object, tag name, attribute name, and attribute value; the
119 callback should return an empty list to drop the attribute, "undef" to
120 keep it without a value, or a new scalar value.
121
122 default
123 print "default is ", $p->default();
124 $p->default(1); # allow tags by default
125 $p->default(
126 undef, # don't change
127 { # default attribute rules
128 '*' => 1, # allow attributes by default
129 }
130 );
131
132 scrub_file
133 $html = $scrubber->scrub_file('foo.html'); ## returns giant string
134 die "Eeek $!" unless defined $html; ## opening foo.html may have failed
135 $scrubber->scrub_file('foo.html', 'new.html') or die "Eeek $!";
136 $scrubber->scrub_file('foo.html', *STDOUT)
137 or die "Eeek $!"
138 if fileno STDOUT;
139
140 scrub
141 print $scrubber->scrub($html); ## returns giant string
142 $scrubber->scrub($html, 'new.html') or die "Eeek $!";
143 $scrubber->scrub($html', *STDOUT)
144 or die "Eeek $!"
145 if fileno STDOUT;
146
147 default handler, used by both "_scrub" and "_scrub_fh". Moved all the
148 common code (basically all of it) into a single routine for ease of
149 maintenance.
150
151 default handler, does the scrubbing if we're scrubbing out to a file.
152 Now calls "_scrub_str" and pushes that out to a file.
153
154 default handler, does the scrubbing if we're returning a giant string.
155 Now calls "_scrub_str" and appends that to the output string.
156
158 When a tag is encountered, HTML::Scrubber allows/denies the tag using
159 the explicit rule if one exists.
160
161 If no explicit rule exists, Scrubber applies the default rule.
162
163 If an explicit rule exists, but it's a simple rule(1), then the default
164 attribute rule is applied.
165
166 EXAMPLE
167 #!/usr/bin/perl -w
168 use HTML::Scrubber;
169 use strict;
170
171 my @allow = qw[ br hr b a ];
172
173 my @rules = (
174 script => 0,
175 img => {
176 src => qr{^(?!http://)}i, # only relative image links allowed
177 alt => 1, # alt attribute allowed
178 '*' => 0, # deny all other attributes
179 },
180 );
181
182 my @default = (
183 0 => # default rule, deny all tags
184 {
185 '*' => 1, # default rule, allow all attributes
186 'href' => qr{^(?:http|https|ftp)://}i,
187 'src' => qr{^(?:http|https|ftp)://}i,
188
189 # If your perl doesn't have qr
190 # just use a string with length greater than 1
191 'cite' => '(?i-xsm:^(?:http|https|ftp):)',
192 'language' => 0,
193 'name' => 1, # could be sneaky, but hey ;)
194 'onblur' => 0,
195 'onchange' => 0,
196 'onclick' => 0,
197 'ondblclick' => 0,
198 'onerror' => 0,
199 'onfocus' => 0,
200 'onkeydown' => 0,
201 'onkeypress' => 0,
202 'onkeyup' => 0,
203 'onload' => 0,
204 'onmousedown' => 0,
205 'onmousemove' => 0,
206 'onmouseout' => 0,
207 'onmouseover' => 0,
208 'onmouseup' => 0,
209 'onreset' => 0,
210 'onselect' => 0,
211 'onsubmit' => 0,
212 'onunload' => 0,
213 'src' => 0,
214 'type' => 0,
215 }
216 );
217
218 my $scrubber = HTML::Scrubber->new();
219 $scrubber->allow(@allow);
220 $scrubber->rules(@rules); # key/value pairs
221 $scrubber->default(@default);
222 $scrubber->comment(1); # 1 allow, 0 deny
223
224 ## preferred way to create the same object
225 $scrubber = HTML::Scrubber->new(
226 allow => \@allow,
227 rules => \@rules,
228 default => \@default,
229 comment => 1,
230 process => 0,
231 );
232
233 require Data::Dumper, die Data::Dumper::Dumper($scrubber) if @ARGV;
234
235 my $it = q[
236 <?php echo(" EVIL EVIL EVIL "); ?> <!-- asdf -->
237 <hr>
238 <I FAKE="attribute" > IN ITALICS WITH FAKE="attribute" </I><br>
239 <B> IN BOLD </B><br>
240 <A NAME="evil">
241 <A HREF="javascript:alert('die die die');">HREF=JAVA <!></A>
242 <br>
243 <A HREF="image/bigone.jpg" ONMOUSEOVER="alert('die die die');">
244 <IMG SRC="image/smallone.jpg" ALT="ONMOUSEOVER JAVASCRIPT">
245 </A>
246 </A> <br>
247 ];
248
249 print "#original text", $/, $it, $/;
250 print
251 "#scrubbed text (default ", $scrubber->default(), # no arguments returns the current value
252 " comment ", $scrubber->comment(), " process ", $scrubber->process(), " )", $/, $scrubber->scrub($it), $/;
253
254 $scrubber->default(1); # allow all tags by default
255 $scrubber->comment(0); # deny comments
256
257 print
258 "#scrubbed text (default ",
259 $scrubber->default(),
260 " comment ",
261 $scrubber->comment(),
262 " process ",
263 $scrubber->process(),
264 " )", $/,
265 $scrubber->scrub($it),
266 $/;
267
268 $scrubber->process(1); # allow process instructions (dangerous)
269 $default[0] = 1; # allow all tags by default
270 $default[1]->{'*'} = 0; # deny all attributes by default
271 $scrubber->default(@default); # set the default again
272
273 print
274 "#scrubbed text (default ",
275 $scrubber->default(),
276 " comment ",
277 $scrubber->comment(),
278 " process ",
279 $scrubber->process(),
280 " )", $/,
281 $scrubber->scrub($it),
282 $/;
283
284 FUN
285 If you have Test::Inline (and you've installed HTML::Scrubber), try
286
287 pod2test Scrubber.pm >scrubber.t
288 perl scrubber.t
289
291 HTML::Parser, Test::Inline.
292
293 The HTML::Sanitizer module is no longer available on CPAN.
294
296 As of version 0.14 I have added a perl minimum version requirement of
297 5.8. This is basically due to failures on the smokers perl 5.6
298 installations - which appears to be down to installation mechanisms and
299 requirements.
300
301 Since I don't want to spend the time supporting a version that is so
302 old (and may not work for reasons on UTF support etc), I have added a
303 "use 5.008;" to the main module.
304
305 If this is problematic I am very willing to accept patches to fix this
306 up, although I do not personally see a good reason to support a release
307 that has been obsolete for 13 years.
308
310 If you want to contribute to the development of this module, the code
311 is on GitHub <http://github.com/nigelm/html-scrubber>. You'll need a
312 perl environment with Dist::Zilla, and if you're just getting started,
313 there's some documentation on using Vagrant and Perlbrew here
314 <http://mrcaron.github.io/2015/03/06/Perl-CPAN-Pull-Request.html>.
315
316 There is now a ".perltidyrc" and a ".tidyallrc" file within the
317 repository for the standard perltidy settings used - I will apply these
318 before new releases. Please do not let formatting prevent you from
319 sending in patches etc - this can be sorted out as part of the release
320 process. Info on "tidyall" can be found at
321 <https://metacpan.org/pod/distribution/Code-TidyAll/bin/tidyall>.
322
324 · Ruslan Zakirov <Ruslan.Zakirov@gmail.com>
325
326 · Nigel Metheringham <nigelm@cpan.org>
327
328 · D. H. <podmaster@cpan.org>
329
331 This software is copyright (c) 2018 by Ruslan Zakirov, Nigel
332 Metheringham, 2003-2004 D. H.
333
334 This is free software; you can redistribute it and/or modify it under
335 the same terms as the Perl 5 programming language system itself.
336
338 Perldoc
339 You can find documentation for this module with the perldoc command.
340
341 perldoc HTML::Scrubber
342
343 Websites
344 The following websites have more information about this module, and may
345 be of help to you. As always, in addition to those websites please use
346 your favorite search engine to discover more resources.
347
348 · MetaCPAN
349
350 A modern, open-source CPAN search engine, useful to view POD in
351 HTML format.
352
353 <https://metacpan.org/release/HTML-Scrubber>
354
355 · Search CPAN
356
357 The default CPAN search engine, useful to view POD in HTML format.
358
359 <http://search.cpan.org/dist/HTML-Scrubber>
360
361 · RT: CPAN's Bug Tracker
362
363 The RT ( Request Tracker ) website is the default bug/issue
364 tracking system for CPAN.
365
366 <https://rt.cpan.org/Public/Dist/Display.html?Name=HTML-Scrubber>
367
368 · AnnoCPAN
369
370 The AnnoCPAN is a website that allows community annotations of Perl
371 module documentation.
372
373 <http://annocpan.org/dist/HTML-Scrubber>
374
375 · CPAN Ratings
376
377 The CPAN Ratings is a website that allows community ratings and
378 reviews of Perl modules.
379
380 <http://cpanratings.perl.org/d/HTML-Scrubber>
381
382 · CPANTS
383
384 The CPANTS is a website that analyzes the Kwalitee ( code metrics )
385 of a distribution.
386
387 <http://cpants.cpanauthors.org/dist/HTML-Scrubber>
388
389 · CPAN Testers
390
391 The CPAN Testers is a network of smoke testers who run automated
392 tests on uploaded CPAN distributions.
393
394 <http://www.cpantesters.org/distro/H/HTML-Scrubber>
395
396 · CPAN Testers Matrix
397
398 The CPAN Testers Matrix is a website that provides a visual
399 overview of the test results for a distribution on various
400 Perls/platforms.
401
402 <http://matrix.cpantesters.org/?dist=HTML-Scrubber>
403
404 · CPAN Testers Dependencies
405
406 The CPAN Testers Dependencies is a website that shows a chart of
407 the test results of all dependencies for a distribution.
408
409 <http://deps.cpantesters.org/?module=HTML::Scrubber>
410
411 Bugs / Feature Requests
412 Please report any bugs or feature requests by email to
413 "bug-html-scrubber at rt.cpan.org", or through the web interface at
414 <https://rt.cpan.org/Public/Bug/Report.html?Queue=HTML-Scrubber>. You
415 will be automatically notified of any progress on the request by the
416 system.
417
418 Source Code
419 The code is open to the world, and available for you to hack on. Please
420 feel free to browse it and play with it, or whatever. If you want to
421 contribute patches, please send me a diff or prod me to pull from your
422 repository :)
423
424 <https://github.com/nigelm/html-scrubber>
425
426 git clone https://github.com/nigelm/html-scrubber.git
427
429 · Andrei Vereha <avereha@gmail.com>
430
431 · Lee Johnson <lee@givengain.ch>
432
433 · Michael Caron <michael.r.caron@gmail.com>
434
435 · Michael Caron <mrcaron@users.noreply.github.com>
436
437 · Nigel Metheringham <nm9762github@muesli.org.uk>
438
439 · Paul Cochrane <paul@liekut.de>
440
441 · Ruslan Zakirov <ruz@bestpractical.com>
442
443 · Sergey Romanov <complefor@rambler.ru>
444
445 · vagrant <vagrant@precise64.(none)>
446
447
448
449perl v5.32.0 2020-07-28 HTML::Scrubber(3)