1HTML::Defang(3) User Contributed Perl Documentation HTML::Defang(3)
2
3
4
6 HTML::Defang - Cleans HTML as well as CSS of scripting and other
7 executable contents, and neutralises XSS attacks.
8
10 my $InputHtml = "<html><body></body></html>";
11
12 my $Defang = HTML::Defang->new(
13 context => $Self,
14 fix_mismatched_tags => 1,
15 tags_to_callback => [ br embed img ],
16 tags_callback => \&DefangTagsCallback,
17 url_callback => \&DefangUrlCallback,
18 css_callback => \&DefangCssCallback,
19 attribs_to_callback => [ qw(border src) ],
20 attribs_callback => \&DefangAttribsCallback,
21 content_callback => \&ContentCallback,
22 );
23
24 my $SanitizedHtml = $Defang->defang($InputHtml);
25
26 # Callback for custom handling specific HTML tags
27 sub DefangTagsCallback {
28 my ($Self, $Defang, $OpenAngle, $lcTag, $IsEndTag, $AttributeHash, $CloseAngle, $HtmlR, $OutR) = @_;
29
30 # Explicitly defang this tag, eventhough safe
31 return DEFANG_ALWAYS if $lcTag eq 'br';
32
33 # Explicitly whitelist this tag, eventhough unsafe
34 return DEFANG_NONE if $lcTag eq 'embed';
35
36 # I am not sure what to do with this tag, so process as HTML::Defang normally would
37 return DEFANG_DEFAULT if $lcTag eq 'img';
38 }
39
40 # Callback for custom handling URLs in HTML attributes as well as style tag/attribute declarations
41 sub DefangUrlCallback {
42 my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $AttributeHash, $HtmlR) = @_;
43
44 # Explicitly allow this URL in tag attributes or stylesheets
45 return DEFANG_NONE if $$AttrValR =~ /safesite.com/i;
46
47 # Explicitly defang this URL in tag attributes or stylesheets
48 return DEFANG_ALWAYS if $$AttrValR =~ /evilsite.com/i;
49 }
50
51 # Callback for custom handling style tags/attributes
52 sub DefangCssCallback {
53 my ($Self, $Defang, $Selectors, $SelectorRules, $Tag, $IsAttr) = @_;
54 my $i = 0;
55 foreach (@$Selectors) {
56 my $SelectorRule = $$SelectorRules[$i];
57 foreach my $KeyValueRules (@$SelectorRule) {
58 foreach my $KeyValueRule (@$KeyValueRules) {
59 my ($Key, $Value) = @$KeyValueRule;
60
61 # Comment out any '!important' directive
62 $$KeyValueRule[2] = DEFANG_ALWAYS if $Value =~ '!important';
63
64 # Comment out any 'position=fixed;' declaration
65 $$KeyValueRule[2] = DEFANG_ALWAYS if $Key =~ 'position' && $Value =~ 'fixed';
66 }
67 }
68 $i++;
69 }
70 }
71
72 # Callback for custom handling HTML tag attributes
73 sub DefangAttribsCallback {
74 my ($Self, $Defang, $lcTag, $lcAttrKey, $AttrValR, $HtmlR) = @_;
75
76 # Change all 'border' attribute values to zero.
77 $$AttrValR = '0' if $lcAttrKey eq 'border';
78
79 # Defang all 'src' attributes
80 return DEFANG_ALWAYS if $lcAttrKey eq 'src';
81
82 return DEFANG_NONE;
83 }
84
85 # Callback for all content between tags (except <style>, <script>, etc)
86 sub DefangContentCallback {
87 my ($Self, $Defang, $ContentR) = @_;
88
89 $$ContentR =~ s/remove this content//;
90 }
91
93 This module accepts an input HTML and/or CSS string and removes any
94 executable code including scripting, embedded objects, applets, etc.,
95 and neutralises any XSS attacks. A whitelist based approach is used
96 which means only HTML known to be safe is allowed through.
97
98 HTML::Defang uses a custom html tag parser. The parser has been
99 designed and tested to work with nasty real world html and to try and
100 emulate as close as possible what browsers actually do with strange
101 looking constructs. The test suite has been built based on examples
102 from a range of sources such as http://ha.ckers.org/xss.html and
103 http://imfo.ru/csstest/css_hacks/import.php to ensure that as many as
104 possible XSS attack scenarios have been dealt with.
105
106 HTML::Defang can make callbacks to client code when it encounters the
107 following:
108
109 · When a specified tag is parsed
110
111 · When a specified attribute is parsed
112
113 · When a URL is parsed as part of an HTML attribute, or CSS property
114 value.
115
116 · When style data is parsed, as part of an HTML style attribute, or
117 as part of an HTML <style> tag.
118
119 The callbacks include details about the current tag/attribute that is
120 being parsed, and also gives a scalar reference to the input HTML.
121 Querying pos() on the input HTML should indicate where the module is
122 with parsing. This gives the client code flexibility in working with
123 HTML::Defang.
124
125 HTML::Defang can defang whole tags, any attribute in a tag, any URL
126 that appear as an attribute or style property, or any CSS declaration
127 in a declaration block in a style rule. This helps to precisely block
128 the most specific unwanted elements in the contents(for example, block
129 just an offending attribute instead of the whole tag), while retaining
130 any safe HTML/CSS.
131
133 HTML::Defang->new(%Options)
134 Constructs a new HTML::Defang object. The following options are
135 supported:
136
137 Options
138 tags_to_callback
139 Array reference of tags for which a call back should be
140 made. If a tag in this array is parsed, the subroutine
141 tags_callback() is invoked.
142
143 attribs_to_callback
144 Array reference of tag attributes for which a call back
145 should be made. If an attribute in this array is parsed,
146 the subroutine attribs_callback() is invoked.
147
148 tags_callback
149 Subroutine reference to be invoked when a tag listed in
150 @$tags_to_callback is parsed.
151
152 attribs_callback
153 Subroutine reference to be invoked when an attribute listed
154 in @$attribs_to_callback is parsed.
155
156 url_callback
157 Subroutine reference to be invoked when a URL is detected
158 in an HTML tag attribute or a CSS property.
159
160 css_callback
161 Subroutine reference to be invoked when CSS data is found
162 either as the contents of a 'style' attribute in an HTML
163 tag, or as the contents of a <style> HTML tag.
164
165 content_callback
166 Subroutine reference to be invoked when standard content
167 between HTML tags in found.
168
169 fix_mismatched_tags
170 This property, if set, fixes mismatched tags in the HTML
171 input. By default, tags present in the default
172 %mismatched_tags_to_fix hash are fixed. This set of tags
173 can be overridden by passing in an array reference
174 $mismatched_tags_to_fix to the constructor. Any opened tags
175 in the set are automatically closed if no corresponding
176 closing tag is found. If an unbalanced closing tag is
177 found, that is commented out.
178
179 mismatched_tags_to_fix
180 Array reference of tags for which the code would check for
181 matching opening and closing tags. See the property
182 $fix_mismatched_tags.
183
184 context
185 You can pass an arbitrary scalar as a 'context' value
186 that's then passed as the first parameter to all callback
187 functions. Most commonly this is something like '$Self'
188
189 allow_double_defang
190 If this is true, then tag names and attribute names which
191 already begin with the defang string ("defang_" by default)
192 will have an additional copy of the defang string prepended
193 if they are flagged to be defanged by the return value of a
194 callback, or if the tag or attribute name is unknown.
195
196 The default is to assume that tag names and attribute names
197 beginning with the defang string are already made safe, and
198 need no further modification, even if they are flagged to
199 be defanged by the return value of a callback. Any tag or
200 attribute modifications made directly by a callback are
201 still performed.
202
203 delete_defang_content
204 Normally defanged tags are turned into comments and
205 prefixed by defang_, and defanged styles are surrounded by
206 /* ... */. If this is set to true, then defanged content is
207 deleted instead
208
209 Debug
210 If set, prints debugging output.
211
212 HTML::Defang->new_bodyonly(%Options)
213 Constructs a new HTML::Defang object that has the following
214 implicit options
215
216 fix_mismatched_tags = 1
217 delete_defang_content = 1
218 tags_to_callback = [ qw(html head link body meta title bgsound) ]
219 tags_callback = { ... remove all above tags and related content ...
220 }
221 url_callback = { ... explicity DEFANG_NONE to leave everything
222 alone ... }
223
224 Basically this is a easy way to remove all html boiler plate
225 content and return only the html body content.
226
228 COMMON PARAMETERS
229 A number of the callbacks share the same parameters. These common
230 parameters are documented here. Certain variables may have specific
231 meanings in certain callbacks, so be sure to check the
232 documentation for that method first before referring this section.
233
234 $context
235 You can pass an arbitrary scalar as a 'context' value that's
236 then passed as the first parameter to all callback functions.
237 Most commonly this is something like '$Self'
238
239 $Defang
240 Current HTML::Defang instance
241
242 $OpenAngle
243 Opening angle(<) sign of the current tag.
244
245 $lcTag
246 Lower case version of the HTML tag that is currently being
247 parsed.
248
249 $IsEndTag
250 Has the value '/' if the current tag is a closing tag.
251
252 $AttributeHash
253 A reference to a hash containing the attributes of the current
254 tag and their values. Each value is a scalar reference to the
255 value, rather than just a scalar value. You can add attributes
256 (remember to make it a scalar ref, eg $AttributeHash{"newattr"}
257 = \"newval"), delete attributes, or modify attribute values in
258 this hash, and any changes you make will be incorporated into
259 the output HTML stream.
260
261 The attribute values will have any entity references decoded
262 before being passed to you, and any unsafe values we be re-
263 encoded back into the HTML stream.
264
265 So for instance, the tag:
266
267 <div title="<"Hi there <">
268
269 Will have the attribute hash:
270
271 { title => \q[<"Hi there <] }
272
273 And will be turned back into the HTML on output:
274
275 <div title="<"Hi there <">
276
277 $CloseAngle
278 Anything after the end of last attribute including the closing
279 HTML angle(>)
280
281 $HtmlR
282 A scalar reference to the input HTML. The input HTML is parsed
283 using m/\G$SomeRegex/c constructs, so to continue from where
284 HTML:Defang left, clients can use m/\G$SomeRegex/c for further
285 processing on the input. This will resume parsing from where
286 HTML::Defang left. One can also use the pos() function to
287 determine where HTML::Defang left off. This combined with the
288 add_to_output() method should give reasonable flexibility for
289 the client to process the input.
290
291 $OutR
292 A scalar reference to the processed output HTML so far.
293
294 tags_callback($context, $Defang, $OpenAngle, $lcTag, $IsEndTag,
295 $AttributeHash, $CloseAngle, $HtmlR, $OutR)
296 If $Defang->{tags_callback} exists, and HTML::Defang has parsed a
297 tag preset in $Defang->{tags_to_callback}, the above callback is
298 made to the client code. The return value of this method determines
299 whether the tag is defanged or not. More details below.
300
301 Return values
302 DEFANG_NONE
303 The current tag will not be defanged.
304
305 DEFANG_ALWAYS
306 The current tag will be defanged.
307
308 DEFANG_DEFAULT
309 The current tag will be processed normally by HTML:Defang
310 as if there was no callback method specified.
311
312 attribs_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
313 $HtmlR, $OutR)
314 If $Defang->{attribs_callback} exists, and HTML::Defang has parsed
315 an attribute present in $Defang->{attribs_to_callback}, the above
316 callback is made to the client code. The return value of this
317 method determines whether the attribute is defanged or not. More
318 details below.
319
320 Method parameters
321 $lcAttrKey
322 Lower case version of the HTML attribute that is currently
323 being parsed.
324
325 $AttrVal
326 Reference to the HTML attribute value that is currently
327 being parsed.
328
329 See $AttributeHash for details of decoding.
330
331 Return values
332 DEFANG_NONE
333 The current attribute will not be defanged.
334
335 DEFANG_ALWAYS
336 The current attribute will be defanged.
337
338 DEFANG_DEFAULT
339 The current attribute will be processed normally by
340 HTML:Defang as if there was no callback method specified.
341
342 url_callback($context, $Defang, $lcTag, $lcAttrKey, $AttrVal,
343 $AttributeHash, $HtmlR, $OutR)
344 If $Defang->{url_callback} exists, and HTML::Defang has parsed a
345 URL, the above callback is made to the client code. The return
346 value of this method determines whether the attribute containing
347 the URL is defanged or not. URL callbacks can be made from <style>
348 tags as well style attributes, in which case the particular style
349 declaration will be commented out. More details below.
350
351 Method parameters
352 $lcAttrKey
353 Lower case version of the HTML attribute that is currently
354 being parsed. However if this callback is made as a result
355 of parsing a URL in a style attribute, $lcAttrKey will be
356 set to the string style, or will be set to undef if this
357 callback is made as a result of parsing a URL inside a
358 style tag.
359
360 $AttrVal
361 Reference to the URL value that is currently being parsed.
362
363 $AttributeHash
364 A reference to a hash containing the attributes of the
365 current tag and their values. Each value is a scalar
366 reference to the value, rather than just a scalar value.
367 You can add attributes (remember to make it a scalar ref,
368 eg $AttributeHash{"newattr"} = \"newval"), delete
369 attributes, or modify attribute values in this hash, and
370 any changes you make will be incorporated into the output
371 HTML stream. Will be set to undef if the callback is made
372 due to URL in a <style> tag or attribute.
373
374 Return values
375 DEFANG_NONE
376 The current URL will not be defanged.
377
378 DEFANG_ALWAYS
379 The current URL will be defanged.
380
381 DEFANG_DEFAULT
382 The current URL will be processed normally by HTML:Defang
383 as if there was no callback method specified.
384
385 css_callback($context, $Defang, $Selectors, $SelectorRules, $lcTag,
386 $IsAttr, $OutR)
387 If $Defang->{css_callback} exists, and HTML::Defang has parsed a
388 <style> tag or style attribtue, the above callback is made to the
389 client code. The return value of this method determines whether a
390 particular declaration in the style rules is defanged or not. More
391 details below.
392
393 Method parameters
394 $Selectors
395 Reference to an array containing the selectors in a style
396 tag or attribute.
397
398 $SelectorRules
399 Reference to an array containing the style declaration
400 blocks of all selectors in a style tag or attribute.
401 Consider the below CSS:
402
403 a { b:c; d:e}
404 j { k:l; m:n}
405
406 The declaration blocks will get parsed into the following
407 data structure:
408
409 [
410 [
411 [ "b", "c", DEFANG_DEFAULT ],
412 [ "d", "e", DEFANG_DEFAULT ]
413 ],
414 [
415 [ "k", "l", DEFANG_DEFAULT ],
416 [ "m", "n", DEFANG_DEFAULT ]
417 ]
418 ]
419
420 So, generally each property:value pair in a declaration is
421 parsed into an array of the form
422
423 ["property", "value", X]
424
425 where X can be DEFANG_NONE, DEFANG_ALWAYS or
426 DEFANG_DEFAULT, and DEFANG_DEFAULT the default value. A
427 client can manipulate this value to instruct HTML::Defang
428 to defang this property:value pair.
429
430 DEFANG_NONE - Do not defang
431
432 DEFANG_ALWAYS - Defang the style:property value
433
434 DEFANG_DEFAULT - Process this as if there is no callback
435 specified
436
437 $IsAttr
438 True if the currently processed item is a style attribute.
439 False if the currently processed item is a style tag.
440
442 PUBLIC METHODS
443 defang($InputHtml, \%Opts)
444 Cleans up $InputHtml of any executable code including
445 scripting, embedded objects, applets, etc., and defang any XSS
446 attacks.
447
448 Method parameters
449 $InputHtml
450 The input HTML string that needs to be sanitized.
451
452 Returns the cleaned HTML. If fix_mismatched_tags is set, any
453 tags that appear in @$mismatched_tags_to_fix that are
454 unbalanced are automatically commented or closed.
455
456 add_to_output($String)
457 Appends $String to the output after the current parsed tag
458 ends. Can be used by client code in callback methods to add
459 HTML text to the processed output. If the HTML text needs to be
460 defanged, client code can safely call HTML::Defang->defang()
461 recursively from within the callback.
462
463 Method parameters
464 $String
465 The string that is added after the current parsed tag
466 ends.
467
468 INTERNAL METHODS
469 Generally these methods never need to be called by users of the
470 class, because they'll be called internally as the appropriate tags
471 are encountered, but they may be useful for some users in some
472 cases.
473
474 defang_script_tag($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
475 $Tag, $TagTrail, $Attributes, $CloseAngle)
476 This method is invoked when a <script> tag is parsed. Defangs
477 the <script> opening tag, and any closing tag. Any scripting
478 content is also commented out, so browsers don't display them.
479
480 Returns 1 to indicate that the <script> tag must be defanged.
481
482 Method parameters
483 $OutR
484 A reference to the processed output HTML before the tag
485 that is currently being parsed.
486
487 $HtmlR
488 A scalar reference to the input HTML.
489
490 $TagOps
491 Indicates what operation should be done on a tag. Can
492 be undefined, integer or code reference. Undefined
493 indicates an unknown tag to HTML::Defang, 1 indicates a
494 known safe tag, 0 indicates a known unsafe tag, and a
495 code reference indicates a subroutine that should be
496 called to parse the current tag. For example, <style>
497 and <script> tags are parsed by dedicated subroutines.
498
499 $OpenAngle
500 Opening angle(<) sign of the current tag.
501
502 $IsEndTag
503 Has the value '/' if the current tag is a closing tag.
504
505 $Tag
506 The HTML tag that is currently being parsed.
507
508 $TagTrail
509 Any space after the tag, but before attributes.
510
511 $Attributes
512 A reference to an array of the attributes and their
513 values, including any surrouding spaces. Each element
514 of the array is added by 'push' calls like below.
515
516 push @$Attributes, [ $AttributeName, $SpaceBeforeEquals, $EqualsAndSubsequentSpace, $QuoteChar, $AttributeValue, $QuoteChar, $SpaceAfterAtributeValue ];
517
518 $CloseAngle
519 Anything after the end of last attribute including the
520 closing HTML angle(>)
521
522 defang_style_text($Content, $lcTag, $IsAttr, $AttributeHash,
523 $HtmlR, $OutR)
524 Defang some raw css data and return the defanged content
525
526 Method parameters
527 $Content
528 The input style string that is defanged.
529
530 $IsAttr
531 True if $Content is from an attribute, otherwise from a
532 <style> block
533
534 cleanup_style($StyleString)
535 Helper function to clean up CSS data. This function directly
536 operates on the input string without taking a copy.
537
538 Method parameters
539 $StyleString
540 The input style string that is cleaned.
541
542 defang_stylerule($SelectorsIn, $StyleRules, $lcTag, $IsAttr,
543 $AttributeHash, $HtmlR, $OutR)
544 Defangs style data.
545
546 Method parameters
547 $SelectorsIn
548 An array reference to the selectors in the style
549 tag/attribute contents.
550
551 $StyleRules
552 An array reference to the declaration blocks in the
553 style tag/attribute contents.
554
555 $lcTag
556 Lower case version of the HTML tag that is currently
557 being parsed.
558
559 $IsAttr
560 Whether we are currently parsing a style attribute or
561 style tag. $IsAttr will be true if we are currently
562 parsing a style attribute.
563
564 $HtmlR
565 A scalar reference to the input HTML.
566
567 $OutR
568 A scalar reference to the processed output so far.
569
570 defang_attributes($OutR, $HtmlR, $TagOps, $OpenAngle, $IsEndTag,
571 $Tag, $TagTrail, $Attributes, $CloseAngle)
572 Defangs attributes, defangs tags, does tag, attrib, css and url
573 callbacks.
574
575 Method parameters
576 For a description of the method parameters, see
577 documentation of defang_script_tag() method
578
579 cleanup_attribute($AttributeString)
580 Helper function to cleanup attributes
581
582 Method parameters
583 $AttributeString
584 The value of the attribute.
585
587 <http://mailtools.anomy.net/>, <http://htmlcleaner.sourceforge.net/>,
588 HTML::StripScripts, HTML::Detoxifier, HTML::Sanitizer, HTML::Scrubber
589
591 Kurian Jose Aerthail <cpan@kurianja.fastmail.fm>. Thanks to Rob Mueller
592 <cpan@robm.fastmail.fm> for initial code, guidance and support and bug
593 fixes.
594
596 Copyright (C) 2003-2013 by FastMail Pty Ltd
597
598 This library is free software; you can redistribute it and/or modify it
599 under the same terms as Perl itself.
600
601
602
603perl v5.30.0 2019-07-26 HTML::Defang(3)