1HTML::Defang(3) User Contributed Perl Documentation HTML::Defang(3)
2
3
4
6 HTML::Defang - Cleans HTML as well as CSS of scripting and other
7 executable contents, and neutralises XSS attacks.
8
10 my $InputHtml = "<html><body></body></html>";
11
12 my $Defang = HTML::Defang->new(
13 context => $Self,
14 fix_mismatched_tags => 1,
15 tags_to_callback => [ br embed img ],
16 tags_callback => \&DefangTagsCallback,
17 url_callback => \&DefangUrlCallback,
18 css_callback => \&DefangCssCallback,
19 attribs_to_callback => [ qw(border src) ],
20 attribs_callback => \&DefangAttribsCallback
21 );
22
23 my $SanitizedHtml = $Defang->defang($InputHtml);
24
25 # Callback for custom handling specific HTML tags
26 sub DefangTagsCallback {
27 my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
28
29 # Explicitly defang this tag, eventhough safe
30 return DEFANG_ALWAYS if $lcTag eq 'br';
31
32 # Explicitly whitelist this tag, eventhough unsafe
33 return DEFANG_NONE if $lcTag eq 'embed';
34
35 # I am not sure what to do with this tag, so process as HTML::Defang normally would
36 return DEFANG_DEFAULT if $lcTag eq 'img';
37 }
38
39 # Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
40 sub DefangUrlCallback {
41 my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
42
43 # Explicitly allow this URL in tag attributes or stylesheets
44 return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
45
46 # Explicitly defang this URL in tag attributes or stylesheets
47 return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
48 }
49
50 # Callback for custom handling style tags/attributes
51 sub DefangCssCallback {
52 my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
53 my $i = 0;
54 foreach (@$Selectors) {
55 my $SelectorRule = $$SelectorRules[$i];
56 foreach my $KeyValueRules (@$SelectorRule) {
57 foreach my $KeyValueRule (@$KeyValueRules) {
58 my ($Key, $Value) = @$KeyValueRule;
59
60 # Comment out any '!important' directive
61 $$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
62
63 # Comment out any 'position=fixed;' declaration
64 $$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
65 }
66 }
67 $i++;
68 }
69 }
70
71 # Callback for custom handling HTML tag attributes
72 sub DefangAttribsCallback {
73 my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
74
75 # Change all 'border' attribute values to zero.
76 $$AttrValR = '0' if $lcAttrKey eq 'border';
77
78 # Defang all 'src' attributes
79 return DEFANG_ALWAYS if $lcAttrKey eq 'src';
80
81 return DEFANG_NONE;
82 }
83
85 This module accepts an input HTML and/or CSS string and removes any
86 executable code including scripting, embedded objects, applets, etc.,
87 and neutralises any XSS attacks. A whitelist based approach is used
88 which means only HTML known to be safe is allowed through.
89
90 HTML::Defang uses a custom html tag parser. The parser has been
91 designed and tested to work with nasty real world html and to try and
92 emulate as close as possible what browsers actually do with strange
93 looking constructs. The test suite has been built based on examples
94 from a range of sources such as http://ha.ckers.org/xss.html and
95 http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
96 possible XSS attack scenarios have been dealt with.
97
98 HTML::Defang can make callbacks to client code when it encounters the
99 following:
100
101 · When a specified tag is parsed
102
103 · When a specified attribute is parsed
104
105 · When a URL is parsed as part of an HTML attribute, or CSS property
106 value.
107
108 · When style data is parsed, as part of an HTML style attribute, or
109 as part of an HTML <style> tag.
110
111 The callbacks include details about the current tag/attribute that is
112 being parsed, and also gives a scalar reference to the input HTML.
113 Querying pos() on the input HTML should indicate where the module is
114 with parsing. This gives the client code flexibility in working with
115 HTML::Defang.
116
117 HTML::Defang can defang whole tags, any attribute in a tag, any URL
118 that appear as an attribute or style property, or any CSS declaration
119 in a declaration block in a style rule. This helps to precisely block
120 the most specific unwanted elements in the contents(for example, block
121 just an offending attribute instead of the whole tag), while retaining
122 any safe HTML/CSS.
123
125 HTML::Defang->new(%Options)
126 Constructs a new HTML::Defang object. The following options are
127 supported:
128
129 Options
130 tags_to_callback
131 Array reference of tags for which a call back should be
132 made. If a tag in this array is parsed, the subroutine
133 tags_callback() is invoked.
134
135 attribs_to_callback
136 Array reference of tag attributes for which a call back
137 should be made. If an attribute in this array is parsed,
138 the subroutine attribs_callback() is invoked.
139
140 tags_callback
141 Subroutine reference to be invoked when a tag listed in
142 @$tags_to_callback is parsed.
143
144 attribs_callback
145 Subroutine reference to be invoked when an attribute listed
146 in @$attribs_to_callback is parsed.
147
148 url_callback
149 Subroutine reference to be invoked when a URL is detected
150 in an HTML tag attribute or a CSS property.
151
152 css_callback
153 Subroutine reference to be invoked when CSS data is found
154 either as the contents of a 'style' attribute in an HTML
155 tag, or as the contents of a <style> HTML tag.
156
157 fix_mismatched_tags
158 This property, if set, fixes mismatched tags in the HTML
159 input. By default, tags present in the default
160 %mismatched_tags_to_fix hash are fixed. This set of tags
161 can be overridden by passing in an array reference
162 $mismatched_tags_to_fix to the constructor. Any opened tags
163 in the set are automatically closed if no corresponding
164 closing tag is found. If an unbalanced closing tag is
165 found, that is commented out.
166
167 mismatched_tags_to_fix
168 Array reference of tags for which the code would check for
169 matching opening and closing tags. See the property
170 $fix_mismatched_tags.
171
172 context
173 You can pass an arbitrary scalar as a 'context' value
174 that's then passed as the first parameter to all callback
175 functions. Most commonly this is something like '$Self'
176
177 allow_double_defang
178 If this is true, then tag names and attribute names which
179 already begin with the defang string ("defang_" by default)
180 will have an additional copy of the defang string prepended
181 if they are flagged to be defanged by the return value of a
182 callback, or if the tag or attribute name is unknown.
183
184 The default is to assume that tag names and attribute names
185 beginning with the defang string are already made safe, and
186 need no further modification, even if they are flagged to
187 be defanged by the return value of a callback. Any tag or
188 attribute modifications made directly by a callback are
189 still performed.
190
191 Debug
192 If set, prints debugging output.
193
195 COMMON PARAMETERS
196 A number of the callbacks share the same parameters. These common
197 parameters are documented here. Certain variables may have specific
198 meanings in certain callbacks, so be sure to check the
199 documentation for that method first before referring this section.
200
201 $context
202 You can pass an arbitrary scalar as a 'context' value that's
203 then passed as the first parameter to all callback functions.
204 Most commonly this is something like '$Self'
205
206 $Defang
207 Current HTML::Defang instance
208
209 $OpenAngle
210 Opening angle(<) sign of the current tag.
211
212 $lcTag
213 Lower case version of the HTML tag that is currently being
214 parsed.
215
216 $IsEndTag
217 Has the value '/' if the current tag is a closing tag.
218
219 $AttributeHash
220 A reference to a hash containing the attributes of the current
221 tag and their values. Each value is a scalar reference to the
222 value, rather than just a scalar value. You can add attributes
223 (remember to make it a scalar ref, eg $AttributeHash{"newattr"}
224 = \"newval"), delete attributes, or modify attribute values in
225 this hash, and any changes you make will be incorporated into
226 the output HTML stream.
227
228 The attribute values will have any entity references decoded
229 before being passed to you, and any unsafe values we be re-
230 encoded back into the HTML stream.
231
232 So for instance, the tag:
233
234 <div title="<"Hi there <">
235
236 Will have the attribute hash:
237
238 { title => \q[<"Hi there <] }
239
240 And will be turned back into the HTML on output:
241
242 <div title="<"Hi there <">
243
244 $CloseAngle
245 Anything after the end of last attribute including the closing
246 HTML angle(>)
247
248 $HtmlR
249 A scalar reference to the input HTML. The input HTML is parsed
250 using m/\G$SomeRegex/c constructs, so to continue from where
251 HTML:Defang left, clients can use m/\G$SomeRegex/c for further
252 processing on the input. This will resume parsing from where
253 HTML::Defang left. One can also use the pos() function to
254 determine where HTML::Defang left off. This combined with the
255 add_to_output() method should give reasonable flexibility for
256 the client to process the input.
257
258 $OutR
259 A scalar reference to the processed output HTML so far.
260
261 tags_callback($context, $Defang, $OpenAngle, $lcTag, $IsEndTag,
262 $AttributeHash, $CloseAngle, $HtmlR, $OutR)
263 If $Defang->{tags_callback} exists, and HTML::Defang has parsed a
264 tag preset in $Defang->{tags_to_callback}, the above callback is
265 made to the client code. The return value of this method determines
266 whether the tag is defanged or not. More details below.
267
268 Return values
269 DEFANG_NONE
270 The current tag will not be defanged.
271
272 DEFANG_ALWAYS
273 The current tag will be defanged.
274
275 DEFANG_DEFAULT
276 The current tag will be processed normally by HTML:Defang
277 as if there was no callback method specified.
278
279 attribs_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
280 $HtmlR, $OutR)
281 If $Defang->{attribs_callback} exists, and HTML::Defang has parsed
282 an attribute present in $Defang->{attribs_to_callback}, the above
283 callback is made to the client code. The return value of this
284 method determines whether the attribute is defanged or not. More
285 details below.
286
287 Method parameters
288 $lcAttrKey
289 Lower case version of the HTML attribute that is currently
290 being parsed.
291
292 $AttrVal
293 Reference to the HTML attribute value that is currently
294 being parsed.
295
296 See $AttributeHash for details of decoding.
297
298 Return values
299 DEFANG_NONE
300 The current attribute will not be defanged.
301
302 DEFANG_ALWAYS
303 The current attribute will be defanged.
304
305 DEFANG_DEFAULT
306 The current attribute will be processed normally by
307 HTML:Defang as if there was no callback method specified.
308
309 url_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
310 $AttributeHash, $HtmlR, $OutR)
311 If $Defang->{url_callback} exists, and HTML::Defang has parsed a
312 URL, the above callback is made to the client code. The return
313 value of this method determines whether the attribute containing
314 the URL is defanged or not. URL callbacks can be made from <style>
315 tags as well style attributes, in which case the particular style
316 declaration will be commented out. More details below.
317
318 Method parameters
319 $lcAttrKey
320 Lower case version of the HTML attribute that is currently
321 being parsed. However if this callback is made as a result
322 of parsing a URL in a style attribute, $lcAttrKey will be
323 set to the string style, or will be set to undef if this
324 callback is made as a result of parsing a URL inside a
325 style tag.
326
327 $AttrVal
328 Reference to the URL value that is currently being parsed.
329
330 $AttributeHash
331 A reference to a hash containing the attributes of the
332 current tag and their values. Each value is a scalar
333 reference to the value, rather than just a scalar value.
334 You can add attributes (remember to make it a scalar ref,
335 eg $AttributeHash{"newattr"} = \"newval"), delete
336 attributes, or modify attribute values in this hash, and
337 any changes you make will be incorporated into the output
338 HTML stream. Will be set to undef if the callback is made
339 due to URL in a <style> tag or attribute.
340
341 Return values
342 DEFANG_NONE
343 The current URL will not be defanged.
344
345 DEFANG_ALWAYS
346 The current URL will be defanged.
347
348 DEFANG_DEFAULT
349 The current URL will be processed normally by HTML:Defang
350 as if there was no callback method specified.
351
352 css_callback($context, $Defang, $Selectors, $SelectorRules, $lcTag,
353 $IsAttr, $OutR)
354 If $Defang->{css_callback} exists, and HTML::Defang has parsed a
355 <style> tag or style attribtue, the above callback is made to the
356 client code. The return value of this method determines whether a
357 particular declaration in the style rules is defanged or not. More
358 details below.
359
360 Method parameters
361 $Selectors
362 Reference to an array containing the selectors in a style
363 tag or attribute.
364
365 $SelectorRules
366 Reference to an array containing the style declaration
367 blocks of all selectors in a style tag or attribute.
368 Consider the below CSS:
369
370 a { b:c; d:e}
371 j { k:l; m:n}
372
373 The declaration blocks will get parsed into the following
374 data structure:
375
376 [
377 [
378 [ "b", "c", DEFANG_DEFAULT ],
379 [ "d", "e", DEFANG_DEFAULT ]
380 ],
381 [
382 [ "k", "l", DEFANG_DEFAULT ],
383 [ "m", "n", DEFANG_DEFAULT ]
384 ]
385 ]
386
387 So, generally each property:value pair in a declaration is
388 parsed into an array of the form
389
390 ["property", "value", X]
391
392 where X can be DEFANG_NONE, DEFANG_ALWAYS or
393 DEFANG_DEFAULT, and DEFANG_DEFAULT the default value. A
394 client can manipulate this value to instruct HTML::Defang
395 to defang this property:value pair.
396
397 DEFANG_NONE - Do not defang
398
399 DEFANG_ALWAYS - Defang the style:property value
400
401 DEFANG_DEFAULT - Process this as if there is no callback
402 specified
403
404 $IsAttr
405 True if the currently processed item is a style attribute.
406 False if the currently processed item is a style tag.
407
409 PUBLIC METHODS
410 defang($InputHtml)
411 Cleans up $InputHtml of any executable code including
412 scripting, embedded objects, applets, etc., and defang any XSS
413 attacks.
414
415 Method parameters
416 $InputHtml
417 The input HTML string that needs to be sanitized.
418
419 Returns the cleaned HTML. If fix_mismatched_tags is set, any
420 tags that appear in @$mismatched_tags_to_fix that are
421 unbalanced are automatically commented or closed.
422
423 add_to_output($String)
424 Appends $String to the output after the current parsed tag
425 ends. Can be used by client code in callback methods to add
426 HTML text to the processed output. If the HTML text needs to be
427 defanged, client code can safely call HTML::Defang->defang()
428 recursively from within the callback.
429
430 Method parameters
431 $String
432 The string that is added after the current parsed tag
433 ends.
434
435 INTERNAL METHODS
436 Generally these methods never need to be called by users of the
437 class, because they'll be called internally as the appropriate tags
438 are encountered, but they may be useful for some users in some
439 cases.
440
441 defang_script($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag,
442 $TagTrail, $Attributes, $CloseAngle)
443 This method is invoked when a <script> tag is parsed. Defangs
444 the <script> opening tag, and any closing tag. Any scripting
445 content is also commented out, so browsers don't display them.
446
447 Returns 1 to indicate that the <script> tag must be defanged.
448
449 Method parameters
450 $OutR
451 A reference to the processed output HTML before the tag
452 that is currently being parsed.
453
454 $HtmlR
455 A scalar reference to the input HTML.
456
457 $TagOps
458 Indicates what operation should be done on a tag. Can
459 be undefined, integer or code reference. Undefined
460 indicates an unknown tag to HTML::Defang, 1 indicates a
461 known safe tag, 0 indicates a known unsafe tag, and a
462 code reference indicates a subroutine that should be
463 called to parse the current tag. For example, <style>
464 and <script> tags are parsed by dedicated subroutines.
465
466 $OpenAngle
467 Opening angle(<) sign of the current tag.
468
469 $IsEndTag
470 Has the value '/' if the current tag is a closing tag.
471
472 $Tag
473 The HTML tag that is currently being parsed.
474
475 $TagTrail
476 Any space after the tag, but before attributes.
477
478 $Attributes
479 A reference to an array of the attributes and their
480 values, including any surrouding spaces. Each element
481 of the array is added by 'push' calls like below.
482
483 push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
484
485 $CloseAngle
486 Anything after the end of last attribute including the
487 closing HTML angle(>)
488
489 defang_style($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag,
490 $TagTrail, $Attributes, $CloseAngle, $IsAttr)
491 Builds a list of selectors and declarations from HTML style
492 tags as well as style attributes in HTML tags and calls
493 defang_stylerule() to do the actual defanging.
494
495 Returns 0 to indicate that style tags must not be defanged.
496
497 Method parameters
498 $IsAttr
499 Whether we are currently parsing a style attribute or
500 style tag. $IsAttr will be true if we are currently
501 parsing a style attribute.
502
503 For a description of other parameters, see documentation of
504 defang_script() method
505
506 cleanup_style($StyleString)
507 Helper function to clean up CSS data. This function directly
508 operates on the input string without taking a copy.
509
510 Method parameters
511 $StyleString
512 The input style string that is cleaned.
513
514 defang_stylerule($SelectorsIn, $StyleRules, $lcTag, $IsAttr,
515 $HtmlR, $OutR)
516 Defangs style data.
517
518 Method parameters
519 $SelectorsIn
520 An array reference to the selectors in the style
521 tag/attribute contents.
522
523 $StyleRules
524 An array reference to the declaration blocks in the
525 style tag/attribute contents.
526
527 $lcTag
528 Lower case version of the HTML tag that is currently
529 being parsed.
530
531 $IsAttr
532 Whether we are currently parsing a style attribute or
533 style tag. $IsAttr will be true if we are currently
534 parsing a style attribute.
535
536 $HtmlR
537 A scalar reference to the input HTML.
538
539 $OutR
540 A scalar reference to the processed output so far.
541
542 defang_attributes($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
543 $Tag, $TagTrail, $Attributes, $CloseAngle)
544 Defangs attributes, defangs tags, does tag, attrib, css and url
545 callbacks.
546
547 Method parameters
548 For a description of the method parameters, see
549 documentation of defang_script() method
550
551 cleanup_attribute($AttributeString)
552 Helper function to cleanup attributes
553
554 Method parameters
555 $AttributeString
556 The value of the attribute.
557
559 <http://mailtools.anomy.net/>, <http://htmlcleaner.sourceforge.net/>,
560 HTML::StripScripts, HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
561
563 Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller
564 <cpan@robm.fastmail.fm> for initial code, guidance and support and bug
565 fixes.
566
568 Copyright (C) 2003-2010 by Opera Software Australia Pty Ltd
569
570 This library is free software; you can redistribute it and/or modify it
571 under the same terms as Perl itself.
572
573
574
575perl v5.12.2 2011-01-03 HTML::Defang(3)