1HTML::Defang(3)       User Contributed Perl Documentation      HTML::Defang(3)
2
3
4

NAME

6       HTML::Defang - Cleans HTML as well as CSS of scripting and other
7       executable contents, and neutralises XSS attacks.
8

SYNOPSIS

10         my $InputHtml = "<html><body></body></html>";
11
12         my $Defang = HTML::Defang->new(
13           context => $Self,
14           fix_mismatched_tags => 1,
15           tags_to_callback => [ br embed img ],
16           tags_callback => \&DefangTagsCallback,
17           url_callback => \&DefangUrlCallback,
18           css_callback => \&DefangCssCallback,
19           attribs_to_callback => [ qw(border src) ],
20           attribs_callback => \&DefangAttribsCallback,
21           content_callback => \&ContentCallback,
22         );
23
24         my $SanitizedHtml = $Defang->defang($InputHtml);
25
26         # Callback for custom handling specific HTML tags
27         sub DefangTagsCallback {
28           my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
29
30           # Explicitly defang this tag, eventhough safe
31           return DEFANG_ALWAYS if $lcTag eq 'br';
32
33           # Explicitly whitelist this tag, eventhough unsafe
34           return DEFANG_NONE if $lcTag eq 'embed';
35
36           # I am not sure what to do with this tag, so process as HTML::Defang normally would
37           return DEFANG_DEFAULT if $lcTag eq 'img';
38         }
39
40         # Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
41         sub DefangUrlCallback {
42           my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
43
44           # Explicitly allow this URL in tag attributes or stylesheets
45           return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
46
47           # Explicitly defang this URL in tag attributes or stylesheets
48           return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
49         }
50
51         # Callback for custom handling style tags/attributes
52         sub DefangCssCallback {
53           my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
54           my $i = 0;
55           foreach (@$Selectors) {
56             my $SelectorRule = $$SelectorRules[$i];
57             foreach my $KeyValueRules (@$SelectorRule) {
58               foreach my $KeyValueRule (@$KeyValueRules) {
59                 my ($Key, $Value) = @$KeyValueRule;
60
61                 # Comment out any '!important' directive
62                 $$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
63
64                 # Comment out any 'position=fixed;' declaration
65                 $$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
66               }
67             }
68             $i++;
69           }
70         }
71
72         # Callback for custom handling HTML tag attributes
73         sub DefangAttribsCallback {
74           my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
75
76           # Change all 'border' attribute values to zero.
77           $$AttrValR = '0' if $lcAttrKey eq 'border';
78
79           # Defang all 'src' attributes
80           return DEFANG_ALWAYS if $lcAttrKey eq 'src';
81
82           return DEFANG_NONE;
83         }
84
85         # Callback for all content between tags (except <style>, <script>, etc)
86         sub DefangContentCallback {
87           my ($Self, $Defang, $ContentR) = @_;
88
89           $$ContentR =~ s/remove this content//;
90         }
91

DESCRIPTION

93       This module accepts an input HTML and/or CSS string and removes any
94       executable code including scripting, embedded objects, applets, etc.,
95       and neutralises any XSS attacks. A whitelist based approach is used
96       which means only HTML known to be safe is allowed through.
97
98       HTML::Defang uses a custom html tag parser. The parser has been
99       designed and tested to work with nasty real world html and to try and
100       emulate as close as possible what browsers actually do with strange
101       looking constructs. The test suite has been built based on examples
102       from a range of sources such as http://ha.ckers.org/xss.html and
103       http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
104       possible XSS attack scenarios have been dealt with.
105
106       HTML::Defang can make callbacks to client code when it encounters the
107       following:
108
109       ·   When a specified tag is parsed
110
111       ·   When a specified attribute is parsed
112
113       ·   When a URL is parsed as part of an HTML attribute, or CSS property
114           value.
115
116       ·   When style data is parsed, as part of an HTML style attribute, or
117           as part of an HTML <style> tag.
118
119       The callbacks include details about the current tag/attribute that is
120       being parsed, and also gives a scalar reference to the input HTML.
121       Querying pos() on the input HTML should indicate where the module is
122       with parsing. This gives the client code flexibility in working with
123       HTML::Defang.
124
125       HTML::Defang can defang whole tags, any attribute in a tag, any URL
126       that appear as an attribute or style property, or any CSS declaration
127       in a declaration block in a style rule. This helps to precisely block
128       the most specific unwanted elements in the contents(for example, block
129       just an offending attribute instead of the whole tag), while retaining
130       any safe HTML/CSS.
131

CONSTRUCTOR

133       HTML::Defang->new(%Options)
134           Constructs a new HTML::Defang object. The following options are
135           supported:
136
137           Options
138               tags_to_callback
139                   Array reference of tags for which a call back should be
140                   made. If a tag in this array is parsed, the subroutine
141                   tags_callback() is invoked.
142
143               attribs_to_callback
144                   Array reference of tag attributes for which a call back
145                   should be made. If an attribute in this array is parsed,
146                   the subroutine attribs_callback() is invoked.
147
148               tags_callback
149                   Subroutine reference to be invoked when a tag listed in
150                   @$tags_to_callback is parsed.
151
152               attribs_callback
153                   Subroutine reference to be invoked when an attribute listed
154                   in @$attribs_to_callback is parsed.
155
156               url_callback
157                   Subroutine reference to be invoked when a URL is detected
158                   in an HTML tag attribute or a CSS property.
159
160               css_callback
161                   Subroutine reference to be invoked when CSS data is found
162                   either as the contents of a 'style' attribute in an HTML
163                   tag, or as the contents of a <style> HTML tag.
164
165               content_callback
166                   Subroutine reference to be invoked when standard content
167                   between HTML tags in found.
168
169               fix_mismatched_tags
170                   This property, if set, fixes mismatched tags in the HTML
171                   input. By default, tags present in the default
172                   %mismatched_tags_to_fix hash are fixed. This set of tags
173                   can be overridden by passing in an array reference
174                   $mismatched_tags_to_fix to the constructor. Any opened tags
175                   in the set are automatically closed if no corresponding
176                   closing tag is found. If an unbalanced closing tag is
177                   found, that is commented out.
178
179               mismatched_tags_to_fix
180                   Array reference of tags for which the code would check for
181                   matching opening and closing tags. See the property
182                   $fix_mismatched_tags.
183
184               context
185                   You can pass an arbitrary scalar as a 'context' value
186                   that's then passed as the first parameter to all callback
187                   functions. Most commonly this is something like '$Self'
188
189               allow_double_defang
190                   If this is true, then tag names and attribute names which
191                   already begin with the defang string ("defang_" by default)
192                   will have an additional copy of the defang string prepended
193                   if they are flagged to be defanged by the return value of a
194                   callback, or if the tag or attribute name is unknown.
195
196                   The default is to assume that tag names and attribute names
197                   beginning with the defang string are already made safe, and
198                   need no further modification, even if they are flagged to
199                   be defanged by the return value of a callback.  Any tag or
200                   attribute modifications made directly by a callback are
201                   still performed.
202
203               delete_defang_content
204                   Normally defanged tags are turned into comments and
205                   prefixed by defang_, and defanged styles are surrounded by
206                   /* ... */. If this is set to true, then defanged content is
207                   deleted instead
208
209               Debug
210                   If set, prints debugging output.
211
212       HTML::Defang->new_bodyonly(%Options)
213           Constructs a new HTML::Defang object that has the following
214           implicit options
215
216           fix_mismatched_tags = 1
217           delete_defang_content = 1
218           tags_to_callback = [ qw(html head link body meta title bgsound) ]
219           tags_callback = { ... remove all above tags and related content ...
220           }
221           url_callback = { ... explicity DEFANG_NONE to leave everything
222           alone ... }
223
224           Basically this is a easy way to remove all html boiler plate
225           content and return only the html body content.
226

CALLBACK METHODS

228       COMMON PARAMETERS
229           A number of the callbacks share the same parameters. These common
230           parameters are documented here. Certain variables may have specific
231           meanings in certain callbacks, so be sure to check the
232           documentation for that method first before referring this section.
233
234           $context
235               You can pass an arbitrary scalar as a 'context' value that's
236               then passed as the first parameter to all callback functions.
237               Most commonly this is something like '$Self'
238
239           $Defang
240               Current HTML::Defang instance
241
242           $OpenAngle
243               Opening angle(<) sign of the current tag.
244
245           $lcTag
246               Lower case version of the HTML tag that is currently being
247               parsed.
248
249           $IsEndTag
250               Has the value '/' if the current tag is a closing tag.
251
252           $AttributeHash
253               A reference to a hash containing the attributes of the current
254               tag and their values. Each value is a scalar reference to the
255               value, rather than just a scalar value. You can add attributes
256               (remember to make it a scalar ref, eg $AttributeHash{"newattr"}
257               = \"newval"), delete attributes, or modify attribute values in
258               this hash, and any changes you make will be incorporated into
259               the output HTML stream.
260
261               The attribute values will have any entity references decoded
262               before being passed to you, and any unsafe values we be re-
263               encoded back into the HTML stream.
264
265               So for instance, the tag:
266
267                 <div title="&lt;&quot;Hi there &#x003C;">
268
269               Will have the attribute hash:
270
271                 { title => \q[<"Hi there <] }
272
273               And will be turned back into the HTML on output:
274
275                 <div title="&lt;&quot;Hi there &lt;">
276
277           $CloseAngle
278               Anything after the end of last attribute including the closing
279               HTML angle(>)
280
281           $HtmlR
282               A scalar reference to the input HTML. The input HTML is parsed
283               using m/\G$SomeRegex/c constructs, so to continue from where
284               HTML:Defang left, clients can use m/\G$SomeRegex/c for further
285               processing on the input. This will resume parsing from where
286               HTML::Defang left. One can also use the pos() function to
287               determine where HTML::Defang left off. This combined with the
288               add_to_output() method should give reasonable flexibility for
289               the client to process the input.
290
291           $OutR
292               A scalar reference to the processed output HTML so far.
293
294       tags_callback($context, $Defang, $OpenAngle, $lcTag, $IsEndTag,
295       $AttributeHash, $CloseAngle, $HtmlR, $OutR)
296           If $Defang->{tags_callback} exists, and HTML::Defang has parsed a
297           tag preset in $Defang->{tags_to_callback}, the above callback is
298           made to the client code. The return value of this method determines
299           whether the tag is defanged or not. More details below.
300
301           Return values
302               DEFANG_NONE
303                   The current tag will not be defanged.
304
305               DEFANG_ALWAYS
306                   The current tag will be defanged.
307
308               DEFANG_DEFAULT
309                   The current tag will be processed normally by HTML:Defang
310                   as if there was no callback method specified.
311
312       attribs_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
313       $HtmlR, $OutR)
314           If $Defang->{attribs_callback} exists, and HTML::Defang has parsed
315           an attribute present in $Defang->{attribs_to_callback}, the above
316           callback is made to the client code. The return value of this
317           method determines whether the attribute is defanged or not. More
318           details below.
319
320           Method parameters
321               $lcAttrKey
322                   Lower case version of the HTML attribute that is currently
323                   being parsed.
324
325               $AttrVal
326                   Reference to the HTML attribute value that is currently
327                   being parsed.
328
329                   See $AttributeHash for details of decoding.
330
331           Return values
332               DEFANG_NONE
333                   The current attribute will not be defanged.
334
335               DEFANG_ALWAYS
336                   The current attribute will be defanged.
337
338               DEFANG_DEFAULT
339                   The current attribute will be processed normally by
340                   HTML:Defang as if there was no callback method specified.
341
342       url_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
343       $AttributeHash, $HtmlR, $OutR)
344           If $Defang->{url_callback} exists, and HTML::Defang has parsed a
345           URL, the above callback is made to the client code. The return
346           value of this method determines whether the attribute containing
347           the URL is defanged or not. URL callbacks can be made from <style>
348           tags as well style attributes, in which case the particular style
349           declaration will be commented out. More details below.
350
351           Method parameters
352               $lcAttrKey
353                   Lower case version of the HTML attribute that is currently
354                   being parsed. However if this callback is made as a result
355                   of parsing a URL in a style attribute, $lcAttrKey will be
356                   set to the string style, or will be set to undef if this
357                   callback is made as a result of parsing a URL inside a
358                   style tag.
359
360               $AttrVal
361                   Reference to the URL value that is currently being parsed.
362
363               $AttributeHash
364                   A reference to a hash containing the attributes of the
365                   current tag and their values. Each value is a scalar
366                   reference to the value, rather than just a scalar value.
367                   You can add attributes (remember to make it a scalar ref,
368                   eg $AttributeHash{"newattr"} = \"newval"), delete
369                   attributes, or modify attribute values in this hash, and
370                   any changes you make will be incorporated into the output
371                   HTML stream. Will be set to undef if the callback is made
372                   due to URL in a <style> tag or attribute.
373
374           Return values
375               DEFANG_NONE
376                   The current URL will not be defanged.
377
378               DEFANG_ALWAYS
379                   The current URL will be defanged.
380
381               DEFANG_DEFAULT
382                   The current URL will be processed normally by HTML:Defang
383                   as if there was no callback method specified.
384
385       css_callback($context, $Defang, $Selectors, $SelectorRules, $lcTag,
386       $IsAttr, $OutR)
387           If $Defang->{css_callback} exists, and HTML::Defang has parsed a
388           <style> tag or style attribtue, the above callback is made to the
389           client code. The return value of this method determines whether a
390           particular declaration in the style rules is defanged or not. More
391           details below.
392
393           Method parameters
394               $Selectors
395                   Reference to an array containing the selectors in a style
396                   tag or attribute.
397
398               $SelectorRules
399                   Reference to an array containing the style declaration
400                   blocks of all selectors in a style tag or attribute.
401                   Consider the below CSS:
402
403                     a { b:c; d:e}
404                     j { k:l; m:n}
405
406                   The declaration blocks will get parsed into the following
407                   data structure:
408
409                     [
410                       [
411                         [ "b", "c", DEFANG_DEFAULT ],
412                         [ "d", "e", DEFANG_DEFAULT ]
413                       ],
414                       [
415                         [ "k", "l", DEFANG_DEFAULT ],
416                         [ "m", "n", DEFANG_DEFAULT ]
417                       ]
418                     ]
419
420                   So, generally each property:value pair in a declaration is
421                   parsed into an array of the form
422
423                     ["property", "value", X]
424
425                   where X can be DEFANG_NONE, DEFANG_ALWAYS or
426                   DEFANG_DEFAULT, and DEFANG_DEFAULT the default value. A
427                   client can manipulate this value to instruct HTML::Defang
428                   to defang this property:value pair.
429
430                   DEFANG_NONE - Do not defang
431
432                   DEFANG_ALWAYS - Defang the style:property value
433
434                   DEFANG_DEFAULT - Process this as if there is no callback
435                   specified
436
437               $IsAttr
438                   True if the currently processed item is a style attribute.
439                   False if the currently processed item is a style tag.
440

METHODS

442       PUBLIC METHODS
443           defang($InputHtml, \%Opts)
444               Cleans up $InputHtml of any executable code including
445               scripting, embedded objects, applets, etc., and defang any XSS
446               attacks.
447
448               Method parameters
449                   $InputHtml
450                       The input HTML string that needs to be sanitized.
451
452               Returns the cleaned HTML. If fix_mismatched_tags is set, any
453               tags that appear in @$mismatched_tags_to_fix that are
454               unbalanced are automatically commented or closed.
455
456           add_to_output($String)
457               Appends $String to the output after the current parsed tag
458               ends. Can be used by client code in callback methods to add
459               HTML text to the processed output. If the HTML text needs to be
460               defanged, client code can safely call HTML::Defang->defang()
461               recursively from within the callback.
462
463               Method parameters
464                   $String
465                       The string that is added after the current parsed tag
466                       ends.
467
468       INTERNAL METHODS
469           Generally these methods never need to be called by users of the
470           class, because they'll be called internally as the appropriate tags
471           are encountered, but they may be useful for some users in some
472           cases.
473
474           defang_script_tag($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
475           $Tag, $TagTrail, $Attributes, $CloseAngle)
476               This method is invoked when a <script> tag is parsed. Defangs
477               the <script> opening tag, and any closing tag. Any scripting
478               content is also commented out, so browsers don't display them.
479
480               Returns 1 to indicate that the <script> tag must be defanged.
481
482               Method parameters
483                   $OutR
484                       A reference to the processed output HTML before the tag
485                       that is currently being parsed.
486
487                   $HtmlR
488                       A scalar reference to the input HTML.
489
490                   $TagOps
491                       Indicates what operation should be done on a tag. Can
492                       be undefined, integer or code reference. Undefined
493                       indicates an unknown tag to HTML::Defang, 1 indicates a
494                       known safe tag, 0 indicates a known unsafe tag, and a
495                       code reference indicates a subroutine that should be
496                       called to parse the current tag. For example, <style>
497                       and <script> tags are parsed by dedicated subroutines.
498
499                   $OpenAngle
500                       Opening angle(<) sign of the current tag.
501
502                   $IsEndTag
503                       Has the value '/' if the current tag is a closing tag.
504
505                   $Tag
506                       The HTML tag that is currently being parsed.
507
508                   $TagTrail
509                       Any space after the tag, but before attributes.
510
511                   $Attributes
512                       A reference to an array of the attributes and their
513                       values, including any surrouding spaces. Each element
514                       of the array is added by 'push' calls like below.
515
516                         push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
517
518                   $CloseAngle
519                       Anything after the end of last attribute including the
520                       closing HTML angle(>)
521
522           defang_style_text($Content, $lcTag, $IsAttr, $AttributeHash,
523           $HtmlR, $OutR)
524               Defang some raw css data and return the defanged content
525
526               Method parameters
527                   $Content
528                       The input style string that is defanged.
529
530                   $IsAttr
531                       True if $Content is from an attribute, otherwise from a
532                       <style> block
533
534           cleanup_style($StyleString)
535               Helper function to clean up CSS data. This function directly
536               operates on the input string without taking a copy.
537
538               Method parameters
539                   $StyleString
540                       The input style string that is cleaned.
541
542           defang_stylerule($SelectorsIn, $StyleRules, $lcTag, $IsAttr,
543           $AttributeHash, $HtmlR, $OutR)
544               Defangs style data.
545
546               Method parameters
547                   $SelectorsIn
548                       An array reference to the selectors in the style
549                       tag/attribute contents.
550
551                   $StyleRules
552                       An array reference to the declaration blocks in the
553                       style tag/attribute contents.
554
555                   $lcTag
556                       Lower case version of the HTML tag that is currently
557                       being parsed.
558
559                   $IsAttr
560                       Whether we are currently parsing a style attribute or
561                       style tag. $IsAttr will be true if we are currently
562                       parsing a style attribute.
563
564                   $HtmlR
565                       A scalar reference to the input HTML.
566
567                   $OutR
568                       A scalar reference to the processed output so far.
569
570           defang_attributes($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
571           $Tag, $TagTrail, $Attributes, $CloseAngle)
572               Defangs attributes, defangs tags, does tag, attrib, css and url
573               callbacks.
574
575               Method parameters
576                   For a description of the method parameters, see
577                   documentation of defang_script_tag() method
578
579           cleanup_attribute($AttributeString)
580               Helper function to cleanup attributes
581
582               Method parameters
583                   $AttributeString
584                       The value of the attribute.
585

SEE ALSO

587       <http://mailtools.anomy.net/>, <http://htmlcleaner.sourceforge.net/>,
588       HTML::StripScripts, HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
589

AUTHOR

591       Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller
592       <cpan@robm.fastmail.fm> for initial code, guidance and support and bug
593       fixes.
594
596       Copyright (C) 2003-2013 by FastMail Pty Ltd
597
598       This library is free software; you can redistribute it and/or modify it
599       under the same terms as Perl itself.
600
601
602
603perl v5.32.0                      2020-07-28                   HTML::Defang(3)
Impressum