1HTML::Defang(3)       User Contributed Perl Documentation      HTML::Defang(3)
2
3
4

NAME

6       HTML::Defang - Cleans HTML as well as CSS of scripting and other
7       executable contents, and neutralises XSS attacks.
8

SYNOPSIS

10         my $InputHtml = "<html><body></body></html>";
11
12         my $Defang = HTML::Defang->new(
13           context => $Self,
14           fix_mismatched_tags => 1,
15           tags_to_callback => [ br embed img ],
16           tags_callback => \&DefangTagsCallback,
17           url_callback => \&DefangUrlCallback,
18           css_callback => \&DefangCssCallback,
19           attribs_to_callback => [ qw(border src) ],
20           attribs_callback => \&DefangAttribsCallback
21         );
22
23         my $SanitizedHtml = $Defang->defang($InputHtml);
24
25         # Callback for custom handling specific HTML tags
26         sub DefangTagsCallback {
27           my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
28
29           # Explicitly defang this tag, eventhough safe
30           return DEFANG_ALWAYS if $lcTag eq 'br';
31
32           # Explicitly whitelist this tag, eventhough unsafe
33           return DEFANG_NONE if $lcTag eq 'embed';
34
35           # I am not sure what to do with this tag, so process as HTML::Defang normally would
36           return DEFANG_DEFAULT if $lcTag eq 'img';
37         }
38
39         # Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
40         sub DefangUrlCallback {
41           my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
42
43           # Explicitly allow this URL in tag attributes or stylesheets
44           return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
45
46           # Explicitly defang this URL in tag attributes or stylesheets
47           return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
48         }
49
50         # Callback for custom handling style tags/attributes
51         sub DefangCssCallback {
52           my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
53           my $i = 0;
54           foreach (@$Selectors) {
55             my $SelectorRule = $$SelectorRules[$i];
56             foreach my $KeyValueRules (@$SelectorRule) {
57               foreach my $KeyValueRule (@$KeyValueRules) {
58                 my ($Key, $Value) = @$KeyValueRule;
59
60                 # Comment out any '!important' directive
61                 $$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
62
63                 # Comment out any 'position=fixed;' declaration
64                 $$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
65               }
66             }
67             $i++;
68           }
69         }
70
71         # Callback for custom handling HTML tag attributes
72         sub DefangAttribsCallback {
73           my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
74
75           # Change all 'border' attribute values to zero.
76           $$AttrValR = '0' if $lcAttrKey eq 'border';
77
78           # Defang all 'src' attributes
79           return DEFANG_ALWAYS if $lcAttrKey eq 'src';
80
81           return DEFANG_NONE;
82         }
83

DESCRIPTION

85       This module accepts an input HTML and/or CSS string and removes any
86       executable code including scripting, embedded objects, applets, etc.,
87       and neutralises any XSS attacks. A whitelist based approach is used
88       which means only HTML known to be safe is allowed through.
89
90       HTML::Defang uses a custom html tag parser. The parser has been
91       designed and tested to work with nasty real world html and to try and
92       emulate as close as possible what browsers actually do with strange
93       looking constructs. The test suite has been built based on examples
94       from a range of sources such as http://ha.ckers.org/xss.html and
95       http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
96       possible XSS attack scenarios have been dealt with.
97
98       HTML::Defang can make callbacks to client code when it encounters the
99       following:
100
101       ·   When a specified tag is parsed
102
103       ·   When a specified attribute is parsed
104
105       ·   When a URL is parsed as part of an HTML attribute, or CSS property
106           value.
107
108       ·   When style data is parsed, as part of an HTML style attribute, or
109           as part of an HTML <style> tag.
110
111       The callbacks include details about the current tag/attribute that is
112       being parsed, and also gives a scalar reference to the input HTML.
113       Querying pos() on the input HTML should indicate where the module is
114       with parsing. This gives the client code flexibility in working with
115       HTML::Defang.
116
117       HTML::Defang can defang whole tags, any attribute in a tag, any URL
118       that appear as an attribute or style property, or any CSS declaration
119       in a declaration block in a style rule. This helps to precisely block
120       the most specific unwanted elements in the contents(for example, block
121       just an offending attribute instead of the whole tag), while retaining
122       any safe HTML/CSS.
123

CONSTRUCTOR

125       HTML::Defang->new(%Options)
126           Constructs a new HTML::Defang object. The following options are
127           supported:
128
129           Options
130               tags_to_callback
131                   Array reference of tags for which a call back should be
132                   made. If a tag in this array is parsed, the subroutine
133                   tags_callback() is invoked.
134
135               attribs_to_callback
136                   Array reference of tag attributes for which a call back
137                   should be made. If an attribute in this array is parsed,
138                   the subroutine attribs_callback() is invoked.
139
140               tags_callback
141                   Subroutine reference to be invoked when a tag listed in
142                   @$tags_to_callback is parsed.
143
144               attribs_callback
145                   Subroutine reference to be invoked when an attribute listed
146                   in @$attribs_to_callback is parsed.
147
148               url_callback
149                   Subroutine reference to be invoked when a URL is detected
150                   in an HTML tag attribute or a CSS property.
151
152               css_callback
153                   Subroutine reference to be invoked when CSS data is found
154                   either as the contents of a 'style' attribute in an HTML
155                   tag, or as the contents of a <style> HTML tag.
156
157               fix_mismatched_tags
158                   This property, if set, fixes mismatched tags in the HTML
159                   input. By default, tags present in the default
160                   %mismatched_tags_to_fix hash are fixed. This set of tags
161                   can be overridden by passing in an array reference
162                   $mismatched_tags_to_fix to the constructor. Any opened tags
163                   in the set are automatically closed if no corresponding
164                   closing tag is found. If an unbalanced closing tag is
165                   found, that is commented out.
166
167               mismatched_tags_to_fix
168                   Array reference of tags for which the code would check for
169                   matching opening and closing tags. See the property
170                   $fix_mismatched_tags.
171
172               context
173                   You can pass an arbitrary scalar as a 'context' value
174                   that's then passed as the first parameter to all callback
175                   functions. Most commonly this is something like '$Self'
176
177               allow_double_defang
178                   If this is true, then tag names and attribute names which
179                   already begin with the defang string ("defang_" by default)
180                   will have an additional copy of the defang string prepended
181                   if they are flagged to be defanged by the return value of a
182                   callback, or if the tag or attribute name is unknown.
183
184                   The default is to assume that tag names and attribute names
185                   beginning with the defang string are already made safe, and
186                   need no further modification, even if they are flagged to
187                   be defanged by the return value of a callback.  Any tag or
188                   attribute modifications made directly by a callback are
189                   still performed.
190
191               Debug
192                   If set, prints debugging output.
193

CALLBACK METHODS

195       COMMON PARAMETERS
196           A number of the callbacks share the same parameters. These common
197           parameters are documented here. Certain variables may have specific
198           meanings in certain callbacks, so be sure to check the
199           documentation for that method first before referring this section.
200
201           $context
202               You can pass an arbitrary scalar as a 'context' value that's
203               then passed as the first parameter to all callback functions.
204               Most commonly this is something like '$Self'
205
206           $Defang
207               Current HTML::Defang instance
208
209           $OpenAngle
210               Opening angle(<) sign of the current tag.
211
212           $lcTag
213               Lower case version of the HTML tag that is currently being
214               parsed.
215
216           $IsEndTag
217               Has the value '/' if the current tag is a closing tag.
218
219           $AttributeHash
220               A reference to a hash containing the attributes of the current
221               tag and their values. Each value is a scalar reference to the
222               value, rather than just a scalar value. You can add attributes
223               (remember to make it a scalar ref, eg $AttributeHash{"newattr"}
224               = \"newval"), delete attributes, or modify attribute values in
225               this hash, and any changes you make will be incorporated into
226               the output HTML stream.
227
228               The attribute values will have any entity references decoded
229               before being passed to you, and any unsafe values we be re-
230               encoded back into the HTML stream.
231
232               So for instance, the tag:
233
234                 <div title="&lt;&quot;Hi there &#x003C;">
235
236               Will have the attribute hash:
237
238                 { title => \q[<"Hi there <] }
239
240               And will be turned back into the HTML on output:
241
242                 <div title="&lt;&quot;Hi there &lt;">
243
244           $CloseAngle
245               Anything after the end of last attribute including the closing
246               HTML angle(>)
247
248           $HtmlR
249               A scalar reference to the input HTML. The input HTML is parsed
250               using m/\G$SomeRegex/c constructs, so to continue from where
251               HTML:Defang left, clients can use m/\G$SomeRegex/c for further
252               processing on the input. This will resume parsing from where
253               HTML::Defang left. One can also use the pos() function to
254               determine where HTML::Defang left off. This combined with the
255               add_to_output() method should give reasonable flexibility for
256               the client to process the input.
257
258           $OutR
259               A scalar reference to the processed output HTML so far.
260
261       tags_callback($context, $Defang, $OpenAngle, $lcTag, $IsEndTag,
262       $AttributeHash, $CloseAngle, $HtmlR, $OutR)
263           If $Defang->{tags_callback} exists, and HTML::Defang has parsed a
264           tag preset in $Defang->{tags_to_callback}, the above callback is
265           made to the client code. The return value of this method determines
266           whether the tag is defanged or not. More details below.
267
268           Return values
269               DEFANG_NONE
270                   The current tag will not be defanged.
271
272               DEFANG_ALWAYS
273                   The current tag will be defanged.
274
275               DEFANG_DEFAULT
276                   The current tag will be processed normally by HTML:Defang
277                   as if there was no callback method specified.
278
279       attribs_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
280       $HtmlR, $OutR)
281           If $Defang->{attribs_callback} exists, and HTML::Defang has parsed
282           an attribute present in $Defang->{attribs_to_callback}, the above
283           callback is made to the client code. The return value of this
284           method determines whether the attribute is defanged or not. More
285           details below.
286
287           Method parameters
288               $lcAttrKey
289                   Lower case version of the HTML attribute that is currently
290                   being parsed.
291
292               $AttrVal
293                   Reference to the HTML attribute value that is currently
294                   being parsed.
295
296                   See $AttributeHash for details of decoding.
297
298           Return values
299               DEFANG_NONE
300                   The current attribute will not be defanged.
301
302               DEFANG_ALWAYS
303                   The current attribute will be defanged.
304
305               DEFANG_DEFAULT
306                   The current attribute will be processed normally by
307                   HTML:Defang as if there was no callback method specified.
308
309       url_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
310       $AttributeHash, $HtmlR, $OutR)
311           If $Defang->{url_callback} exists, and HTML::Defang has parsed a
312           URL, the above callback is made to the client code. The return
313           value of this method determines whether the attribute containing
314           the URL is defanged or not. URL callbacks can be made from <style>
315           tags as well style attributes, in which case the particular style
316           declaration will be commented out. More details below.
317
318           Method parameters
319               $lcAttrKey
320                   Lower case version of the HTML attribute that is currently
321                   being parsed. However if this callback is made as a result
322                   of parsing a URL in a style attribute, $lcAttrKey will be
323                   set to the string style, or will be set to undef if this
324                   callback is made as a result of parsing a URL inside a
325                   style tag.
326
327               $AttrVal
328                   Reference to the URL value that is currently being parsed.
329
330               $AttributeHash
331                   A reference to a hash containing the attributes of the
332                   current tag and their values. Each value is a scalar
333                   reference to the value, rather than just a scalar value.
334                   You can add attributes (remember to make it a scalar ref,
335                   eg $AttributeHash{"newattr"} = \"newval"), delete
336                   attributes, or modify attribute values in this hash, and
337                   any changes you make will be incorporated into the output
338                   HTML stream. Will be set to undef if the callback is made
339                   due to URL in a <style> tag or attribute.
340
341           Return values
342               DEFANG_NONE
343                   The current URL will not be defanged.
344
345               DEFANG_ALWAYS
346                   The current URL will be defanged.
347
348               DEFANG_DEFAULT
349                   The current URL will be processed normally by HTML:Defang
350                   as if there was no callback method specified.
351
352       css_callback($context, $Defang, $Selectors, $SelectorRules, $lcTag,
353       $IsAttr, $OutR)
354           If $Defang->{css_callback} exists, and HTML::Defang has parsed a
355           <style> tag or style attribtue, the above callback is made to the
356           client code. The return value of this method determines whether a
357           particular declaration in the style rules is defanged or not. More
358           details below.
359
360           Method parameters
361               $Selectors
362                   Reference to an array containing the selectors in a style
363                   tag or attribute.
364
365               $SelectorRules
366                   Reference to an array containing the style declaration
367                   blocks of all selectors in a style tag or attribute.
368                   Consider the below CSS:
369
370                     a { b:c; d:e}
371                     j { k:l; m:n}
372
373                   The declaration blocks will get parsed into the following
374                   data structure:
375
376                     [
377                       [
378                         [ "b", "c", DEFANG_DEFAULT ],
379                         [ "d", "e", DEFANG_DEFAULT ]
380                       ],
381                       [
382                         [ "k", "l", DEFANG_DEFAULT ],
383                         [ "m", "n", DEFANG_DEFAULT ]
384                       ]
385                     ]
386
387                   So, generally each property:value pair in a declaration is
388                   parsed into an array of the form
389
390                     ["property", "value", X]
391
392                   where X can be DEFANG_NONE, DEFANG_ALWAYS or
393                   DEFANG_DEFAULT, and DEFANG_DEFAULT the default value. A
394                   client can manipulate this value to instruct HTML::Defang
395                   to defang this property:value pair.
396
397                   DEFANG_NONE - Do not defang
398
399                   DEFANG_ALWAYS - Defang the style:property value
400
401                   DEFANG_DEFAULT - Process this as if there is no callback
402                   specified
403
404               $IsAttr
405                   True if the currently processed item is a style attribute.
406                   False if the currently processed item is a style tag.
407

METHODS

409       PUBLIC METHODS
410           defang($InputHtml)
411               Cleans up $InputHtml of any executable code including
412               scripting, embedded objects, applets, etc., and defang any XSS
413               attacks.
414
415               Method parameters
416                   $InputHtml
417                       The input HTML string that needs to be sanitized.
418
419               Returns the cleaned HTML. If fix_mismatched_tags is set, any
420               tags that appear in @$mismatched_tags_to_fix that are
421               unbalanced are automatically commented or closed.
422
423           add_to_output($String)
424               Appends $String to the output after the current parsed tag
425               ends. Can be used by client code in callback methods to add
426               HTML text to the processed output. If the HTML text needs to be
427               defanged, client code can safely call HTML::Defang->defang()
428               recursively from within the callback.
429
430               Method parameters
431                   $String
432                       The string that is added after the current parsed tag
433                       ends.
434
435       INTERNAL METHODS
436           Generally these methods never need to be called by users of the
437           class, because they'll be called internally as the appropriate tags
438           are encountered, but they may be useful for some users in some
439           cases.
440
441           defang_script($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag,
442           $TagTrail, $Attributes, $CloseAngle)
443               This method is invoked when a <script> tag is parsed. Defangs
444               the <script> opening tag, and any closing tag. Any scripting
445               content is also commented out, so browsers don't display them.
446
447               Returns 1 to indicate that the <script> tag must be defanged.
448
449               Method parameters
450                   $OutR
451                       A reference to the processed output HTML before the tag
452                       that is currently being parsed.
453
454                   $HtmlR
455                       A scalar reference to the input HTML.
456
457                   $TagOps
458                       Indicates what operation should be done on a tag. Can
459                       be undefined, integer or code reference. Undefined
460                       indicates an unknown tag to HTML::Defang, 1 indicates a
461                       known safe tag, 0 indicates a known unsafe tag, and a
462                       code reference indicates a subroutine that should be
463                       called to parse the current tag. For example, <style>
464                       and <script> tags are parsed by dedicated subroutines.
465
466                   $OpenAngle
467                       Opening angle(<) sign of the current tag.
468
469                   $IsEndTag
470                       Has the value '/' if the current tag is a closing tag.
471
472                   $Tag
473                       The HTML tag that is currently being parsed.
474
475                   $TagTrail
476                       Any space after the tag, but before attributes.
477
478                   $Attributes
479                       A reference to an array of the attributes and their
480                       values, including any surrouding spaces. Each element
481                       of the array is added by 'push' calls like below.
482
483                         push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
484
485                   $CloseAngle
486                       Anything after the end of last attribute including the
487                       closing HTML angle(>)
488
489           defang_style($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag, $Tag,
490           $TagTrail, $Attributes, $CloseAngle, $IsAttr)
491               Builds a list of selectors and declarations from HTML style
492               tags as well as style attributes in HTML tags and calls
493               defang_stylerule() to do the actual defanging.
494
495               Returns 0 to indicate that style tags must not be defanged.
496
497               Method parameters
498                   $IsAttr
499                       Whether we are currently parsing a style attribute or
500                       style tag. $IsAttr will be true if we are currently
501                       parsing a style attribute.
502
503                   For a description of other parameters, see documentation of
504                   defang_script() method
505
506           cleanup_style($StyleString)
507               Helper function to clean up CSS data. This function directly
508               operates on the input string without taking a copy.
509
510               Method parameters
511                   $StyleString
512                       The input style string that is cleaned.
513
514           defang_stylerule($SelectorsIn, $StyleRules, $lcTag, $IsAttr,
515           $HtmlR, $OutR)
516               Defangs style data.
517
518               Method parameters
519                   $SelectorsIn
520                       An array reference to the selectors in the style
521                       tag/attribute contents.
522
523                   $StyleRules
524                       An array reference to the declaration blocks in the
525                       style tag/attribute contents.
526
527                   $lcTag
528                       Lower case version of the HTML tag that is currently
529                       being parsed.
530
531                   $IsAttr
532                       Whether we are currently parsing a style attribute or
533                       style tag. $IsAttr will be true if we are currently
534                       parsing a style attribute.
535
536                   $HtmlR
537                       A scalar reference to the input HTML.
538
539                   $OutR
540                       A scalar reference to the processed output so far.
541
542           defang_attributes($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
543           $Tag, $TagTrail, $Attributes, $CloseAngle)
544               Defangs attributes, defangs tags, does tag, attrib, css and url
545               callbacks.
546
547               Method parameters
548                   For a description of the method parameters, see
549                   documentation of defang_script() method
550
551           cleanup_attribute($AttributeString)
552               Helper function to cleanup attributes
553
554               Method parameters
555                   $AttributeString
556                       The value of the attribute.
557

SEE ALSO

559       <http://mailtools.anomy.net/>, <http://htmlcleaner.sourceforge.net/>,
560       HTML::StripScripts, HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
561

AUTHOR

563       Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller
564       <cpan@robm.fastmail.fm> for initial code, guidance and support and bug
565       fixes.
566
568       Copyright (C) 2003-2010 by Opera Software Australia Pty Ltd
569
570       This library is free software; you can redistribute it and/or modify it
571       under the same terms as Perl itself.
572
573
574
575perl v5.12.2                      2011-01-03                   HTML::Defang(3)
Impressum