1HTML::StripScripts(3) User Contributed Perl DocumentationHTML::StripScripts(3)
2
3
4
6 HTML::StripScripts - Strip scripting constructs out of HTML
7
9 use HTML::StripScripts;
10
11 my $hss = HTML::StripScripts->new({ Context => 'Inline' });
12
13 $hss->input_start_document;
14
15 $hss->input_start('<i>');
16 $hss->input_text('hello, world!');
17 $hss->input_end('</i>');
18
19 $hss->input_end_document;
20
21 print $hss->filtered_document;
22
24 This module strips scripting constructs out of HTML, leaving as much
25 non-scripting markup in place as possible. This allows web
26 applications to display HTML originating from an untrusted source
27 without introducing XSS (cross site scripting) vulnerabilities.
28
29 You will probably use HTML::StripScripts::Parser rather than using this
30 module directly.
31
32 The process is based on whitelists of tags, attributes and attribute
33 values. This approach is the most secure against disguised scripting
34 constructs hidden in malicious HTML documents.
35
36 As well as removing scripting constructs, this module ensures that
37 there is a matching end for each start tag, and that the tags are
38 properly nested.
39
40 Previously, in order to customise the output, you needed to subclass
41 "HTML::StripScripts" and override methods. Now, most customisation can
42 be done through the "Rules" option provided to "new()". (See
43 examples/declaration/ and examples/tags/ for cases where subclassing is
44 necessary.)
45
46 The HTML document must be parsed into start tags, end tags and text
47 before it can be filtered by this module. Use either
48 HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you
49 want to input an unparsed HTML document.
50
51 See examples/direct/ for an example of how to feed tokens directly to
52 HTML::StripScripts.
53
55 new ( CONFIG )
56 Creates a new "HTML::StripScripts" filter object, bound to a
57 particular filtering policy. If present, the CONFIG parameter must
58 be a hashref. The following keys are recognized (unrecognized keys
59 will be silently ignored).
60
61 $s = HTML::Stripscripts->new({
62 Context => 'Document|Flow|Inline|NoTags',
63 BanList => [qw( br img )] | {br => '1', img => '1'},
64 BanAllBut => [qw(p div span)],
65 AllowSrc => 0|1,
66 AllowHref => 0|1,
67 AllowRelURL => 0|1,
68 AllowMailto => 0|1,
69 EscapeFiltered => 0|1,
70 Rules => { See below for details },
71 });
72
73 "Context"
74 A string specifying the context in which the filtered document
75 will be used. This influences the set of tags that will be
76 allowed.
77
78 If present, the "Context" value must be one of:
79
80 "Document"
81 If "Context" is "Document" then the filter will allow a
82 full HTML document, including the "HTML" tag and "HEAD" and
83 "BODY" sections.
84
85 "Flow"
86 If "Context" is "Flow" then most of the cosmetic tags that
87 one would expect to find in a document body are allowed,
88 including lists and tables but not including forms.
89
90 "Inline"
91 If "Context" is "Inline" then only inline tags such as "B"
92 and "FONT" are allowed.
93
94 "NoTags"
95 If "Context" is "NoTags" then no tags are allowed.
96
97 The default "Context" value is "Flow".
98
99 "BanList"
100 If present, this option must be an arrayref or a hashref. Any
101 tag that would normally be allowed (because it presents no XSS
102 hazard) will be blocked if the lowercase name of the tag is in
103 this list.
104
105 For example, in a guestbook application where "HR" tags are
106 used to separate posts, you may wish to prevent posts from
107 including "HR" tags, even though "HR" is not an XSS risk.
108
109 "BanAllBut"
110 If present, this option must be reference to an array holding a
111 list of lowercase tag names. This has the effect of adding all
112 but the listed tags to the ban list, so that only those tags
113 listed will be allowed.
114
115 "AllowSrc"
116 By default, the filter won't allow constructs that cause the
117 browser to fetch things automatically, such as "SRC" attributes
118 in "IMG" tags. If this option is present and true then those
119 constructs will be allowed.
120
121 "AllowHref"
122 By default, the filter won't allow constructs that cause the
123 browser to fetch things if the user clicks on something, such
124 as the "HREF" attribute in "A" tags. Set this option to a true
125 value to allow this type of construct.
126
127 "AllowRelURL"
128 By default, the filter won't allow relative URLs such as
129 "../foo.html" in "SRC" and "HREF" attribute values. Set this
130 option to a true value to allow them. "AllowHref" and / or
131 "AllowSrc" also need to be set to true for this to have any
132 effect.
133
134 "AllowMailto"
135 By default, "mailto:" links are not allowed. If "AllowMailto"
136 is set to a true value, then this construct will be allowed.
137 This can be enabled separately from AllowHref.
138
139 "EscapeFiltered"
140 By default, any filtered tags are outputted as
141 "<!--filtered-->". If "EscapeFiltered" is set to a true value,
142 then the filtered tags are converted to HTML entities.
143
144 For instance:
145
146 <br> --> <br>
147
148 "Rules"
149 The "Rules" option provides a very flexible way of customising
150 the filter.
151
152 The focus is safety-first, so it is applied after all of the
153 previous validation. This means that you cannot all malicious
154 data should already have been cleared.
155
156 Rules can be specified for tags and for attributes. Any tag or
157 attribute not explicitly listed will be handled by the default
158 "*" rules.
159
160 The following is a synopsis of all of the options that you can
161 use to configure rules. Below, an example is broken into
162 sections and explained.
163
164 Rules => {
165
166 tag => 0 | 1 | sub { tag_callback }
167 | {
168 attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
169 '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
170 required => [qw(attrname attrname)],
171 tag => sub { tag_callback }
172 },
173
174 '*' => 0 | 1 | sub { tag_callback }
175 | {
176 attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
177 '*' => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
178 tag => sub { tag_callback }
179 }
180
181 }
182
183 EXAMPLE:
184
185 Rules => {
186
187 ##########################
188 ##### EXPLICIT RULES #####
189 ##########################
190
191 ## Allow <br> tags, reject <img> tags
192 br => 1,
193 img => 0,
194
195 ## Send all <div> tags to a sub
196 div => sub { tag_callback },
197
198 ## Allow <blockquote> tags,and allow the 'cite' attribute
199 ## All other attributes are handled by the default C<*>
200 blockquote => {
201 cite => 1,
202 },
203
204 ## Allow <a> tags, and
205 a => {
206
207 ## Allow the 'title' attribute
208 title => 1,
209
210 ## Allow the 'href' attribute if it matches the regex
211 href => '^http://yourdomain.com'
212 OR href => qr{^http://yourdomain.com},
213
214 ## 'style' attributes are handled by a sub
215 style => sub { attr_callback },
216
217 ## All other attributes are rejected
218 '*' => 0,
219
220 ## Additionally, the <a> tag should be handled by this sub
221 tag => sub { tag_callback},
222
223 ## If the <a> tag doesn't have these attributes, filter the tag
224 required => [qw(href title)],
225
226 },
227
228 ##########################
229 ##### DEFAULT RULES #####
230 ##########################
231
232 ## The default '*' rule - accepts all the same options as above.
233 ## If a tag or attribute is not mentioned above, then the default
234 ## rule is applied:
235
236 ## Reject all tags
237 '*' => 0,
238
239 ## Allow all tags and all attributes
240 '*' => 1,
241
242 ## Send all tags to the sub
243 '*' => sub { tag_callback },
244
245 ## Allow all tags, reject all attributes
246 '*' => { '*' => 0 },
247
248 ## Allow all tags, and
249 '*' => {
250
251 ## Allow the 'title' attribute
252 title => 1,
253
254 ## Allow the 'href' attribute if it matches the regex
255 href => '^http://yourdomain.com'
256 OR href => qr{^http://yourdomain.com},
257
258 ## 'style' attributes are handled by a sub
259 style => sub { attr_callback },
260
261 ## All other attributes are rejected
262 '*' => 0,
263
264 ## Additionally, all tags should be handled by this sub
265 tag => sub { tag_callback},
266
267 },
268
269 Tag Callbacks
270 sub tag_callback {
271 my ($filter,$element) = (@_);
272
273 $element = {
274 tag => 'tag',
275 content => 'inner_html',
276 attr => {
277 attr_name => 'attr_value',
278 }
279 };
280 return 0 | 1;
281 }
282
283 A tag callback accepts two parameters, the $filter object
284 and the C$element>. It should return 0 to completely
285 ignore the tag and its content (which includes any nested
286 HTML tags), or 1 to accept and output the tag.
287
288 The $element is a hash ref containing the keys:
289
290 "tag"
291 This is the tagname in lowercase, eg "a", "br", "img". If
292 you set the tag value to an empty string, then the tag will
293 not be outputted, but the tag contents will.
294
295 "content"
296 This is the equivalent of DOM's innerHTML. It contains the
297 text content and any HTML tags contained within this
298 element. You can change the content or set it to an empty
299 string so that it is not outputted.
300
301 "attr"
302 "attr" contains a hashref containing the attribute names
303 and values
304
305 If for instance, you wanted to replace "<b>" tags with "<span>"
306 tags, you could do this:
307
308 sub b_callback {
309 my ($filter,$element) = @_;
310 $element->{tag} = 'span';
311 $element->{attr}{style} = 'font-weight:bold';
312 return 1;
313 }
314
315 Attribute Callbacks
316 sub attr_callback {
317 my ( $filter, $tag, $attr_name, $attr_val ) = @_;
318 return undef | '' | 'value';
319 }
320
321 Attribute callbacks accept four parameters, the $filter object,
322 the $tag name, the $attr_name and the $attr_value.
323
324 It should return either "undef" to reject the attribute, or the
325 value to be used. An empty string keeps the attribute, but
326 without a value.
327
328 "BanList" vs "BanAllBut" vs "Rules"
329 It is not necessary to use "BanList" or "BanAllBut" -
330 everything can be done via "Rules", however it may be simpler
331 to write:
332
333 BanAllBut => [qw(p div span)]
334
335 The logic works as follows:
336
337 * If BanAllBut exists, then ban everything but the tags in the list
338 * Add to the ban list any elements in BanList
339 * Any tags mentioned explicitly in Rules (eg a => 0, br => 1)
340 are added or removed from the BanList
341 * A default rule of { '*' => 0 } would ban all tags except
342 those mentioned in Rules
343 * A default rule of { '*' => 1 } would allow all tags except
344 those disallowed in the ban list, or by explicit rules
345
347 This class provides the following methods:
348
349 hss_init ()
350 This method is called by new() and does the actual initialisation
351 work for the new HTML::StripScripts object.
352
353 input_start_document ()
354 This method initializes the filter, and must be called once before
355 starting on each HTML document to be filtered.
356
357 input_start ( TEXT )
358 Handles a start tag from the input document. TEXT must be the full
359 text of the tag, including angle-brackets.
360
361 input_end ( TEXT )
362 Handles an end tag from the input document. TEXT must be the full
363 text of the end tag, including angle-brackets.
364
365 input_text ( TEXT )
366 Handles some non-tag text from the input document.
367
368 input_process ( TEXT )
369 Handles a processing instruction from the input document.
370
371 input_comment ( TEXT )
372 Handles an HTML comment from the input document.
373
374 input_declaration ( TEXT )
375 Handles an declaration from the input document.
376
377 input_end_document ()
378 Call this method to signal the end of the input document.
379
380 filtered_document ()
381 Returns the filtered document as a string.
382
384 The only reason for subclassing this module now is to add to the list
385 of accepted tags, attributes and styles (See "WHITELIST INITIALIZATION
386 METHODS"). Everything else can be achieved with "Rules".
387
388 The "HTML::StripScripts" class is subclassable. Filter objects are
389 plain hashes and "HTML::StripScripts" reserves only hash keys that
390 start with "_hss". The filter configuration can be set up by invoking
391 the hss_init() method, which takes the same arguments as new().
392
394 The filter outputs a stream of start tags, end tags, text, comments,
395 declarations and processing instructions, via the following "output_*"
396 methods. Subclasses may override these to intercept the filter output.
397
398 The default implementations of the "output_*" methods pass the text on
399 to the output() method. The default implementation of the output()
400 method appends the text to a string, which can be fetched with the
401 filtered_document() method once processing is complete.
402
403 If the output() method or the individual "output_*" methods are
404 overridden in a subclass, then filtered_document() will not work in
405 that subclass.
406
407 output_start_document ()
408 This method gets called once at the start of each HTML document
409 passed through the filter. The default implementation does
410 nothing.
411
412 output_end_document ()
413 This method gets called once at the end of each HTML document
414 passed through the filter. The default implementation does
415 nothing.
416
417 output_start ( TEXT )
418 This method is used to output a filtered start tag.
419
420 output_end ( TEXT )
421 This method is used to output a filtered end tag.
422
423 output_text ( TEXT )
424 This method is used to output some filtered non-tag text.
425
426 output_declaration ( TEXT )
427 This method is used to output a filtered declaration.
428
429 output_comment ( TEXT )
430 This method is used to output a filtered HTML comment.
431
432 output_process ( TEXT )
433 This method is used to output a filtered processing instruction.
434
435 output ( TEXT )
436 This method is invoked by all of the default "output_*" methods.
437 The default implementation appends the text to the string that the
438 filtered_document() method will return.
439
440 output_stack_entry ( TEXT )
441 This method is invoked when a tag plus all text and nested HTML
442 content within the tag has been processed. It adds the tag plus its
443 content to the content for its parent tag.
444
446 When the filter encounters something in the input document which it
447 cannot transform into an acceptable construct, it invokes one of the
448 following "reject_*" methods to put something in the output document to
449 take the place of the unacceptable construct.
450
451 The TEXT parameter is the full text of the unacceptable construct.
452
453 The default implementations of these methods output an HTML comment
454 containing the text "filtered". If "EscapeFiltered" is set to true,
455 then the rejected text is HTML escaped instead.
456
457 Subclasses may override these methods, but should exercise caution.
458 The TEXT parameter is unfiltered input and may contain malicious
459 constructs.
460
461 reject_start ( TEXT )
462 reject_end ( TEXT )
463 reject_text ( TEXT )
464 reject_declaration ( TEXT )
465 reject_comment ( TEXT )
466 reject_process ( TEXT )
467
469 The filter refers to various whitelists to determine which constructs
470 are acceptable. To modify these whitelists, subclasses can override
471 the following methods.
472
473 Each method is called once at object initialization time, and must
474 return a reference to a nested data structure. These references are
475 installed into the object, and used whenever the filter needs to refer
476 to a whitelist.
477
478 The default implementations of these methods can be invoked as class
479 methods.
480
481 See examples/tags/ and examples/declaration/ for examples of how to
482 override these methods.
483
484 init_context_whitelist ()
485 Returns a reference to the "Context" whitelist, which determines
486 which tags may appear at each point in the document, and which
487 other tags may be nested within them.
488
489 It is a hash, and the keys are context names, such as "Flow" and
490 "Inline".
491
492 The values in the hash are hashrefs. The keys in these subhashes
493 are lowercase tag names, and the values are context names,
494 specifying the context that the tag provides to any other tags
495 nested within it.
496
497 The special context "EMPTY" as a value in a subhash indicates that
498 nothing can be nested within that tag.
499
500 init_attrib_whitelist ()
501 Returns a reference to the "Attrib" whitelist, which determines
502 which attributes each tag can have and the values that those
503 attributes can take.
504
505 It is a hash, and the keys are lowercase tag names.
506
507 The values in the hash are hashrefs. The keys in these subhashes
508 are lowercase attribute names, and the values are attribute value
509 class names, which are short strings describing the type of values
510 that the attribute can take, such as "color" or "number".
511
512 init_attval_whitelist ()
513 Returns a reference to the "AttVal" whitelist, which is a hash that
514 maps attribute value class names from the "Attrib" whitelist to
515 coderefs to subs to validate (and optionally transform) a
516 particular attribute value.
517
518 The filter calls the attribute value validation subs with the
519 following parameters:
520
521 "filter"
522 A reference to the filter object.
523
524 "tagname"
525 The lowercase name of the tag in which the attribute appears.
526
527 "attrname"
528 The name of the attribute.
529
530 "attrval"
531 The attribute value found in the input document, in canonical
532 form (see "CANONICAL FORM").
533
534 The validation sub can return undef to indicate that the attribute
535 should be removed from the tag, or it can return the new value for
536 the attribute, in canonical form.
537
538 init_style_whitelist ()
539 Returns a reference to the "Style" whitelist, which determines
540 which CSS style directives are permitted in "style" tag attributes.
541 The keys are value names such as "color" and "background-color",
542 and the values are class names to be used as keys into the "AttVal"
543 whitelist.
544
545 init_deinter_whitelist
546 Returns a reference to the "DeInter" whitelist, which determines
547 which inline tags the filter should attempt to automatically de-
548 interleave if they are encountered interleaved. For example, the
549 filter will transform:
550
551 <b>hello <i>world</b> !</i>
552
553 Into:
554
555 <b>hello <i>world</i></b><i> !</i>
556
557 because both "b" and "i" appear as keys in the "DeInter" whitelist.
558
560 These methods transform attribute values and non-tag text from the
561 input document into canonical form (see "CANONICAL FORM"), and
562 transform text in canonical form into a suitable form for the output
563 document.
564
565 text_to_canonical_form ( TEXT )
566 This method is used to reduce non-tag text from the input document
567 to canonical form before passing it to the filter_text() method.
568
569 The default implementation unescapes all entities that map to
570 "US-ASCII" characters other than ampersand, and replaces any
571 ampersands that don't form part of valid entities with "&".
572
573 quoted_to_canonical_form ( VALUE )
574 This method is used to reduce attribute values quoted with
575 doublequotes or singlequotes to canonical form before passing it to
576 the handler subs in the "AttVal" whitelist.
577
578 The default behavior is the same as that of
579 "text_to_canonical_form()", plus it converts any CR, LF or TAB
580 characters to spaces.
581
582 unquoted_to_canonical_form ( VALUE )
583 This method is used to reduce attribute values without quotes to
584 canonical form before passing it to the handler subs in the
585 "AttVal" whitelist.
586
587 The default implementation simply replaces all ampersands with
588 "&", since that corresponds with the way most browsers treat
589 entities in unquoted values.
590
591 canonical_form_to_text ( TEXT )
592 This method is used to convert the text in canonical form returned
593 by the filter_text() method to a form suitable for inclusion in the
594 output document.
595
596 The default implementation runs anything that doesn't look like a
597 valid entity through the escape_html_metachars() method.
598
599 canonical_form_to_attval ( ATTVAL )
600 This method is used to convert the text in canonical form returned
601 by the "AttVal" handler subs to a form suitable for inclusion in
602 doublequotes in the output tag.
603
604 The default implementation converts CR, LF and TAB characters to a
605 single space, and runs anything that doesn't look like a valid
606 entity through the escape_html_metachars() method.
607
608 validate_href_attribute ( TEXT )
609 If the "AllowHref" filter configuration option is set, then this
610 method is used to validate "href" type attribute values. TEXT is
611 the attribute value in canonical form. Returns a possibly modified
612 attribute value (in canonical form) or "undef" to reject the
613 attribute.
614
615 The default implementation allows only absolute "http" and "https"
616 URLs, permits port numbers and query strings, and imposes
617 reasonable length limits.
618
619 It does not URI escape the query string, and it does not guarantee
620 properly formatted URIs, it just tries to give safe URIs. You can
621 always use an attribute callback (see "Attribute Callbacks") to
622 provide stricter handling.
623
624 validate_mailto ( TEXT )
625 If the "AllowMailto" filter configuration option is set, then this
626 method is used to validate "href" type attribute values which begin
627 with "mailto:". TEXT is the attribute value in canonical form.
628 Returns a possibly modified attribute value (in canonical form) or
629 "undef" to reject the attribute.
630
631 This uses a lightweight regex and does not guarantee that email
632 addresses are properly formatted. You can always use an attribute
633 callback (see "Attribute Callbacks") to provide stricter handling.
634
635 validate_src_attribute ( TEXT )
636 If the "AllowSrc" filter configuration option is set, then this
637 method is used to validate "src" type attribute values. TEXT is
638 the attribute value in canonical form. Returns a possibly modified
639 attribute value (in canonical form) or "undef" to reject the
640 attribute.
641
642 The default implementation behaves as validate_href_attribute().
643
645 As well as the output, reject, init and cdata methods listed above, it
646 might make sense for subclasses to override the following methods:
647
648 filter_text ( TEXT )
649 This method will be invoked to filter blocks of non-tag text in the
650 input document. Both input and output are in canonical form, see
651 "CANONICAL FORM".
652
653 The default implementation does no filtering.
654
655 escape_html_metachars ( TEXT )
656 This method is used to escape all HTML metacharacters in TEXT. The
657 return value must be a copy of TEXT with metacharacters escaped.
658
659 The default implementation escapes a minimal set of metacharacters
660 for security against XSS vulnerabilities. The set of characters to
661 escape is a compromise between the need for security and the need
662 to ensure that the filter will work for documents in as many
663 different character sets as possible.
664
665 Subclasses which make strong assumptions about the document
666 character set will be able to escape much more aggressively.
667
668 strip_nonprintable ( TEXT )
669 Returns a copy of TEXT with runs of nonprintable characters
670 replaced with spaces or some other harmless string. Avoids
671 replacing anything with the empty string, as that can lead to other
672 security issues.
673
674 The default implementation strips out only NULL characters, in
675 order to avoid scrambling text for as many different character sets
676 as possible.
677
678 Subclasses which make some sort of assumption about the character
679 set in use will be able to have a much wider definition of a
680 nonprintable character, and hence a more secure
681 strip_nonprintable() implementation.
682
684 References to the following subs appear in the "AttVal" whitelist
685 returned by the init_attval_whitelist() method.
686
687 _hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
688 Attribute value hander for the "style" attribute.
689
690 _hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
691 Attribute value handler for attributes who's values are some sort
692 of size or length.
693
694 _hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
695 Attribute value handler for attributes who's values are a simple
696 integer.
697
698 _hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
699 Attribute value handler for color attributes.
700
701 _hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
702 Attribute value handler for text attributes.
703
704 _hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
705 Attribute value handler for attributes who's values must consist of
706 a single short word, with minus characters permitted.
707
708 _hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
709 Attribute value handler for attributes who's values must consist of
710 one or more words, separated by spaces and/or commas.
711
712 _hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
713 Attribute value handler for attributes who's values must consist of
714 one or more words, separated by commas, with optional doublequotes
715 around words and spaces allowed within the doublequotes.
716
717 _hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
718 Attribute value handler for "href" type attributes. If the
719 "AllowHref" or "AllowMailto" configuration options are set, uses
720 the validate_href_attribute() method to check the attribute value.
721
722 _hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
723 Attribute value handler for "src" type attributes. If the
724 "AllowSrc" configuration option is set, uses the
725 validate_src_attribute() method to check the attribute value.
726
727 _hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
728 Attribute value handler for "src" type style pseudo attributes.
729
730 _hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
731 Attribute value handler for attributes that have no value or a
732 value that is ignored. Just returns the attribute name as the
733 value.
734
736 Many of the methods described above deal with text from the input
737 document, encoded in what I call "canonical form", defined as follows:
738
739 All characters other than ampersands represent themselves. Literal
740 ampersands are encoded as "&". Non "US-ASCII" characters may
741 appear as literals in whatever character set is in use, or they may
742 appear as named or numeric HTML entities such as "æ", "穩"
743 and "ÿ". Unknown named entities such as "&foo;" may appear.
744
745 The idea is to be able to be able to reduce input text to a minimal
746 form, without making too many assumptions about the character set in
747 use.
748
750 The following methods are internal to this class, and should not be
751 invoked from elsewhere. Subclasses should not use or override these
752 methods.
753
754 _hss_prepare_ban_list (CFG)
755 Returns a hash ref representing all the banned tags, based on the
756 values of BanList and BanAllBut
757
758 _hss_prepare_rules (CFG)
759 Returns a hash ref representing the tag and attribute rules (See
760 "Rules").
761
762 Returns undef if no filters are specified, in which case the
763 attribute filter code has very little performance impact. If any
764 rules are specified, then every tag and attribute is checked.
765
766 _hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME)
767 Returns the attribute filter rule to apply to this particular
768 attribute.
769
770 Checks for:
771
772 - a named attribute rule in a named tag
773 - a default * attribute rule in a named tag
774 - a named attribute rule in the default * rules
775 - a default * attribute rule in the default * rules
776
777 _hss_join_attribs (FILTERED_ATTRIBS)
778 Accepts a hash ref containing the attribute names as the keys, and
779 the attribute values as the values. Escapes them and returns a
780 string ready for output to HTML
781
782 _hss_decode_numeric ( NUMERIC )
783 Returns the string that should replace the numeric entity NUMERIC
784 in the text_to_canonical_form() method.
785
786 _hss_tag_is_banned ( TAGNAME )
787 Returns true if the lower case tag name TAGNAME is on the list of
788 harmless tags that the filter is configured to block, false
789 otherwise.
790
791 _hss_get_to_valid_context ( TAG )
792 Tries to get the filter to a context in which the tag TAG is
793 allowed, by introducing extra end tags or start tags if necessary.
794 TAG can be either the lower case name of a tag or the string
795 'CDATA'.
796
797 Returns 1 if an allowed context is reached, or 0 if there's no
798 reasonable way to get to an allowed context and the tag should just
799 be rejected.
800
801 _hss_close_innermost_tag ()
802 Closes the innermost open tag.
803
804 _hss_context ()
805 Returns the current named context of the filter.
806
807 _hss_valid_in_context ( TAG, CONTEXT )
808 Returns true if the lowercase tag name TAG is valid in context
809 CONTEXT, false otherwise.
810
811 _hss_valid_in_current_context ( TAG )
812 Returns true if the lowercase tag name TAG is valid in the filter's
813 current context, false otherwise.
814
816 Performance
817 This module does a lot of work to ensure that tags are correctly
818 nested and are not left open, causing unnecessary overhead for
819 applications where that doesn't matter.
820
821 Such applications may benefit from using the more lightweight
822 HTML::Scrubber::StripScripts module instead.
823
824 Strictness
825 URIs and email addresses are cleaned up to be safe, but not
826 necessarily accurate. That would have required adding
827 dependencies. Attribute callbacks can be used to add this
828 functionality if required, or the validation methods can be
829 overridden.
830
831 By default, filtered HTML may not be valid strict XHTML, for
832 instance empty required attributes may be outputted. However, with
833 "Rules", it should be possible to force the HTML to validate.
834
835 REPORTING BUGS
836 Please report any bugs or feature requests to
837 bug-html-stripscripts@rt.cpan.org, or through the web interface at
838 <http://rt.cpan.org>.
839
841 HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex
842
844 Original author Nick Cleaton <nick@cleaton.net>
845
846 New code added and module maintained by Clinton Gormley
847 <clint@traveljury.com>
848
850 Copyright (C) 2003 Nick Cleaton. All Rights Reserved.
851
852 Copyright (C) 2007 Clinton Gormley. All Rights Reserved.
853
855 This module is free software; you can redistribute it and/or modify it
856 under the same terms as Perl itself.
857
858
859
860perl v5.30.1 2020-01-30 HTML::StripScripts(3)