1HTML::StripScripts(3) User Contributed Perl DocumentationHTML::StripScripts(3)
2
3
4

NAME

6       HTML::StripScripts - Strip scripting constructs out of HTML
7

SYNOPSIS

9         use HTML::StripScripts;
10
11         my $hss = HTML::StripScripts->new({ Context => 'Inline' });
12
13         $hss->input_start_document;
14
15         $hss->input_start('<i>');
16         $hss->input_text('hello, world!');
17         $hss->input_end('</i>');
18
19         $hss->input_end_document;
20
21         print $hss->filtered_document;
22

DESCRIPTION

24       This module strips scripting constructs out of HTML, leaving as much
25       non-scripting markup in place as possible.  This allows web
26       applications to display HTML originating from an untrusted source
27       without introducing XSS (cross site scripting) vulnerabilities.
28
29       You will probably use HTML::StripScripts::Parser rather than using this
30       module directly.
31
32       The process is based on whitelists of tags, attributes and attribute
33       values.  This approach is the most secure against disguised scripting
34       constructs hidden in malicious HTML documents.
35
36       As well as removing scripting constructs, this module ensures that
37       there is a matching end for each start tag, and that the tags are
38       properly nested.
39
40       Previously, in order to customise the output, you needed to subclass
41       "HTML::StripScripts" and override methods.  Now, most customisation can
42       be done through the "Rules" option provided to new(). (See
43       examples/declaration/ and examples/tags/ for cases where subclassing is
44       necessary.)
45
46       The HTML document must be parsed into start tags, end tags and text
47       before it can be filtered by this module.  Use either
48       HTML::StripScripts::Parser or HTML::StripScripts::Regex instead if you
49       want to input an unparsed HTML document.
50
51       See examples/direct/ for an example of how to feed tokens directly to
52        HTML::StripScripts.
53

CONSTRUCTORS

55       new ( CONFIG )
56           Creates a new "HTML::StripScripts" filter object, bound to a
57           particular filtering policy.  If present, the CONFIG parameter must
58           be a hashref.  The following keys are recognized (unrecognized keys
59           will be silently ignored).
60
61               $s = HTML::Stripscripts->new({
62                   Context         => 'Document|Flow|Inline|NoTags',
63                   BanList         => [qw( br img )] | {br => '1', img => '1'},
64                   BanAllBut       => [qw(p div span)],
65                   AllowSrc        => 0|1,
66                   AllowHref       => 0|1,
67                   AllowRelURL     => 0|1,
68                   AllowMailto     => 0|1,
69                   EscapeFiltered  => 0|1,
70                   Rules           => { See below for details },
71               });
72
73           "Context"
74               A string specifying the context in which the filtered document
75               will be used.  This influences the set of tags that will be
76               allowed.
77
78               If present, the "Context" value must be one of:
79
80               "Document"
81                   If "Context" is "Document" then the filter will allow a
82                   full HTML document, including the "HTML" tag and "HEAD" and
83                   "BODY" sections.
84
85               "Flow"
86                   If "Context" is "Flow" then most of the cosmetic tags that
87                   one would expect to find in a document body are allowed,
88                   including lists and tables but not including forms.
89
90               "Inline"
91                   If "Context" is "Inline" then only inline tags such as "B"
92                   and "FONT" are allowed.
93
94               "NoTags"
95                   If "Context" is "NoTags" then no tags are allowed.
96
97               The default "Context" value is "Flow".
98
99           "BanList"
100               If present, this option must be an arrayref or a hashref.  Any
101               tag that would normally be allowed (because it presents no XSS
102               hazard) will be blocked if the lowercase name of the tag is in
103               this list.
104
105               For example, in a guestbook application where "HR" tags are
106               used to separate posts, you may wish to prevent posts from
107               including "HR" tags, even though "HR" is not an XSS risk.
108
109           "BanAllBut"
110               If present, this option must be reference to an array holding a
111               list of lowercase tag names.  This has the effect of adding all
112               but the listed tags to the ban list, so that only those tags
113               listed will be allowed.
114
115           "AllowSrc"
116               By default, the filter won't allow constructs that cause the
117               browser to fetch things automatically, such as "SRC" attributes
118               in "IMG" tags.  If this option is present and true then those
119               constructs will be allowed.
120
121           "AllowHref"
122               By default, the filter won't allow constructs that cause the
123               browser to fetch things if the user clicks on something, such
124               as the "HREF" attribute in "A" tags.  Set this option to a true
125               value to allow this type of construct.
126
127           "AllowRelURL"
128               By default, the filter won't allow relative URLs such as
129               "../foo.html" in "SRC" and "HREF" attribute values.  Set this
130               option to a true value to allow them. "AllowHref" and / or
131               "AllowSrc" also need to be set to true for this to have any
132               effect.
133
134           "AllowMailto"
135               By default, "mailto:" links are not allowed. If "AllowMailto"
136               is set to a true value, then this construct will be allowed.
137               This can be enabled separately from AllowHref.
138
139           "EscapeFiltered"
140               By default, any filtered tags are outputted as
141               "<!--filtered-->". If "EscapeFiltered" is set to a true value,
142               then the filtered tags are converted to HTML entities.
143
144               For instance:
145
146                 <br>  -->  &lt;br&gt;
147
148           "Rules"
149               The "Rules" option provides a very flexible way of customising
150               the filter.
151
152               The focus is safety-first, so it is applied after all of the
153               previous validation.  This means that you cannot all malicious
154               data should already have been cleared.
155
156               Rules can be specified for tags and for attributes. Any tag or
157               attribute not explicitly listed will be handled by the default
158               "*" rules.
159
160               The following is a synopsis of all of the options that you can
161               use to configure rules.  Below, an example is broken into
162               sections and explained.
163
164                Rules => {
165
166                    tag => 0 | 1 | sub { tag_callback }
167                           | {
168                               attr      => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
169                               '*'       => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
170                               required  => [qw(attrname attrname)],
171                               tag       => sub { tag_callback }
172                             },
173
174                   '*' => 0 | 1 | sub { tag_callback }
175                          | {
176                              attr => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
177                              '*'  => 0 | 1 | 'regex' | qr/regex/ | sub { attr_callback},
178                              tag  => sub { tag_callback }
179                            }
180
181                   }
182
183               EXAMPLE:
184
185                   Rules => {
186
187                       ##########################
188                       ##### EXPLICIT RULES #####
189                       ##########################
190
191                       ## Allow <br> tags, reject <img> tags
192                       br          => 1,
193                       img         => 0,
194
195                       ## Send all <div> tags to a sub
196                       div         => sub { tag_callback },
197
198                       ## Allow <blockquote> tags,and allow the 'cite' attribute
199                       ## All other attributes are handled by the default C<*>
200                       blockquote  => {
201                           cite    => 1,
202                       },
203
204                       ## Allow <a> tags, and
205                       a  => {
206
207                           ## Allow the 'title' attribute
208                           title     => 1,
209
210                           ## Allow the 'href' attribute if it matches the regex
211                           href    =>   '^http://yourdomain.com'
212                      OR   href    => qr{^http://yourdomain.com},
213
214                           ## 'style' attributes are handled by a sub
215                           style     => sub { attr_callback },
216
217                           ## All other attributes are rejected
218                           '*'       => 0,
219
220                           ## Additionally, the <a> tag should be handled by this sub
221                           tag       => sub { tag_callback},
222
223                           ## If the <a> tag doesn't have these attributes, filter the tag
224                           required  => [qw(href title)],
225
226                       },
227
228                       ##########################
229                       ##### DEFAULT RULES #####
230                       ##########################
231
232                       ## The default '*' rule - accepts all the same options as above.
233                       ## If a tag or attribute is not mentioned above, then the default
234                       ## rule is applied:
235
236                       ## Reject all tags
237                       '*'         => 0,
238
239                       ## Allow all tags and all attributes
240                       '*'         => 1,
241
242                       ## Send all tags to the sub
243                       '*'         => sub { tag_callback },
244
245                       ## Allow all tags, reject all attributes
246                       '*'         => { '*'  => 0 },
247
248                       ## Allow all tags, and
249                       '*' => {
250
251                           ## Allow the 'title' attribute
252                           title   => 1,
253
254                           ## Allow the 'href' attribute if it matches the regex
255                           href    =>   '^http://yourdomain.com'
256                      OR   href    => qr{^http://yourdomain.com},
257
258                           ## 'style' attributes are handled by a sub
259                           style   => sub { attr_callback },
260
261                           ## All other attributes are rejected
262                           '*'     => 0,
263
264                           ## Additionally, all tags should be handled by this sub
265                           tag     => sub { tag_callback},
266
267                       },
268
269               Tag Callbacks
270                       sub tag_callback {
271                           my ($filter,$element) = (@_);
272
273                           $element = {
274                               tag      => 'tag',
275                               content  => 'inner_html',
276                               attr     => {
277                                   attr_name => 'attr_value',
278                               }
279                           };
280                           return 0 | 1;
281                       }
282
283                   A tag callback accepts two parameters, the $filter object
284                   and the C$element>.  It should return 0 to completely
285                   ignore the tag and its content (which includes any nested
286                   HTML tags), or 1 to accept and output the tag.
287
288                   The $element is a hash ref containing the keys:
289
290               "tag"
291                   This is the tagname in lowercase, eg "a", "br", "img". If
292                   you set the tag value to an empty string, then the tag will
293                   not be outputted, but the tag contents will.
294
295               "content"
296                   This is the equivalent of DOM's innerHTML. It contains the
297                   text content and any HTML tags contained within this
298                   element. You can change the content or set it to an empty
299                   string so that it is not outputted.
300
301               "attr"
302                   "attr" contains a hashref containing the attribute names
303                   and values
304
305               If for instance, you wanted to replace "<b>" tags with "<span>"
306               tags, you could do this:
307
308                   sub b_callback {
309                       my ($filter,$element)   = @_;
310                       $element->{tag}         = 'span';
311                       $element->{attr}{style} = 'font-weight:bold';
312                       return 1;
313                   }
314
315           Attribute Callbacks
316                   sub attr_callback {
317                       my ( $filter, $tag, $attr_name, $attr_val ) = @_;
318                       return undef | '' | 'value';
319                   }
320
321               Attribute callbacks accept four parameters, the $filter object,
322               the $tag name, the $attr_name and the $attr_value.
323
324               It should return either "undef" to reject the attribute, or the
325               value to be used. An empty string keeps the attribute, but
326               without a value.
327
328           "BanList" vs "BanAllBut" vs "Rules"
329               It is not necessary to use "BanList" or "BanAllBut" -
330               everything can be done via "Rules", however it may be simpler
331               to write:
332
333                   BanAllBut => [qw(p div span)]
334
335               The logic works as follows:
336
337                  * If BanAllBut exists, then ban everything but the tags in the list
338                  * Add to the ban list any elements in BanList
339                  * Any tags mentioned explicitly in Rules (eg a => 0, br => 1)
340                    are added or removed from the BanList
341                  * A default rule of { '*' => 0 } would ban all tags except
342                    those mentioned in Rules
343                  * A default rule of { '*' => 1 } would allow all tags except
344                    those disallowed in the ban list, or by explicit rules
345

METHODS

347       This class provides the following methods:
348
349       hss_init ()
350           This method is called by new() and does the actual initialisation
351           work for the new HTML::StripScripts object.
352
353       input_start_document ()
354           This method initializes the filter, and must be called once before
355           starting on each HTML document to be filtered.
356
357       input_start ( TEXT )
358           Handles a start tag from the input document.  TEXT must be the full
359           text of the tag, including angle-brackets.
360
361       input_end ( TEXT )
362           Handles an end tag from the input document.  TEXT must be the full
363           text of the end tag, including angle-brackets.
364
365       input_text ( TEXT )
366           Handles some non-tag text from the input document.
367
368       input_process ( TEXT )
369           Handles a processing instruction from the input document.
370
371       input_comment ( TEXT )
372           Handles an HTML comment from the input document.
373
374       input_declaration ( TEXT )
375           Handles an declaration from the input document.
376
377       input_end_document ()
378           Call this method to signal the end of the input document.
379
380       filtered_document ()
381           Returns the filtered document as a string.
382

SUBCLASSING

384       The only reason for subclassing this module now is to add to the list
385       of accepted tags, attributes and styles (See "WHITELIST INITIALIZATION
386       METHODS").  Everything else can be achieved with "Rules".
387
388       The "HTML::StripScripts" class is subclassable.  Filter objects are
389       plain hashes and "HTML::StripScripts" reserves only hash keys that
390       start with "_hss".  The filter configuration can be set up by invoking
391       the hss_init() method, which takes the same arguments as new().
392

OUTPUT METHODS

394       The filter outputs a stream of start tags, end tags, text, comments,
395       declarations and processing instructions, via the following "output_*"
396       methods.  Subclasses may override these to intercept the filter output.
397
398       The default implementations of the "output_*" methods pass the text on
399       to the output() method.  The default implementation of the output()
400       method appends the text to a string, which can be fetched with the
401       filtered_document() method once processing is complete.
402
403       If the output() method or the individual "output_*" methods are
404       overridden in a subclass, then filtered_document() will not work in
405       that subclass.
406
407       output_start_document ()
408           This method gets called once at the start of each HTML document
409           passed through the filter.  The default implementation does
410           nothing.
411
412       output_end_document ()
413           This method gets called once at the end of each HTML document
414           passed through the filter.  The default implementation does
415           nothing.
416
417       output_start ( TEXT )
418           This method is used to output a filtered start tag.
419
420       output_end ( TEXT )
421           This method is used to output a filtered end tag.
422
423       output_text ( TEXT )
424           This method is used to output some filtered non-tag text.
425
426       output_declaration ( TEXT )
427           This method is used to output a filtered declaration.
428
429       output_comment ( TEXT )
430           This method is used to output a filtered HTML comment.
431
432       output_process ( TEXT )
433           This method is used to output a filtered processing instruction.
434
435       output ( TEXT )
436           This method is invoked by all of the default "output_*" methods.
437           The default implementation appends the text to the string that the
438           filtered_document() method will return.
439
440       output_stack_entry ( TEXT )
441           This method is invoked when a tag plus all text and nested HTML
442           content within the tag has been processed. It adds the tag plus its
443           content to the content for its parent tag.
444

REJECT METHODS

446       When the filter encounters something in the input document which it
447       cannot transform into an acceptable construct, it invokes one of the
448       following "reject_*" methods to put something in the output document to
449       take the place of the unacceptable construct.
450
451       The TEXT parameter is the full text of the unacceptable construct.
452
453       The default implementations of these methods output an HTML comment
454       containing the text "filtered". If "EscapeFiltered" is set to true,
455       then the rejected text is HTML escaped instead.
456
457       Subclasses may override these methods, but should exercise caution.
458       The TEXT parameter is unfiltered input and may contain malicious
459       constructs.
460
461       reject_start ( TEXT )
462       reject_end ( TEXT )
463       reject_text ( TEXT )
464       reject_declaration ( TEXT )
465       reject_comment ( TEXT )
466       reject_process ( TEXT )
467

WHITELIST INITIALIZATION METHODS

469       The filter refers to various whitelists to determine which constructs
470       are acceptable.  To modify these whitelists, subclasses can override
471       the following methods.
472
473       Each method is called once at object initialization time, and must
474       return a reference to a nested data structure.  These references are
475       installed into the object, and used whenever the filter needs to refer
476       to a whitelist.
477
478       The default implementations of these methods can be invoked as class
479       methods.
480
481       See examples/tags/ and examples/declaration/ for examples of how to
482       override these methods.
483
484       init_context_whitelist ()
485           Returns a reference to the "Context" whitelist, which determines
486           which tags may appear at each point in the document, and which
487           other tags may be nested within them.
488
489           It is a hash, and the keys are context names, such as "Flow" and
490           "Inline".
491
492           The values in the hash are hashrefs.  The keys in these subhashes
493           are lowercase tag names, and the values are context names,
494           specifying the context that the tag provides to any other tags
495           nested within it.
496
497           The special context "EMPTY" as a value in a subhash indicates that
498           nothing can be nested within that tag.
499
500       init_attrib_whitelist ()
501           Returns a reference to the "Attrib" whitelist, which determines
502           which attributes each tag can have and the values that those
503           attributes can take.
504
505           It is a hash, and the keys are lowercase tag names.
506
507           The values in the hash are hashrefs.  The keys in these subhashes
508           are lowercase attribute names, and the values are attribute value
509           class names, which are short strings describing the type of values
510           that the attribute can take, such as "color" or "number".
511
512       init_attval_whitelist ()
513           Returns a reference to the "AttVal" whitelist, which is a hash that
514           maps attribute value class names from the "Attrib" whitelist to
515           coderefs to subs to validate (and optionally transform) a
516           particular attribute value.
517
518           The filter calls the attribute value validation subs with the
519           following parameters:
520
521           "filter"
522               A reference to the filter object.
523
524           "tagname"
525               The lowercase name of the tag in which the attribute appears.
526
527           "attrname"
528               The name of the attribute.
529
530           "attrval"
531               The attribute value found in the input document, in canonical
532               form (see "CANONICAL FORM").
533
534           The validation sub can return undef to indicate that the attribute
535           should be removed from the tag, or it can return the new value for
536           the attribute, in canonical form.
537
538       init_style_whitelist ()
539           Returns a reference to the "Style" whitelist, which determines
540           which CSS style directives are permitted in "style" tag attributes.
541           The keys are value names such as "color" and "background-color",
542           and the values are class names to be used as keys into the "AttVal"
543           whitelist.
544
545       init_deinter_whitelist
546           Returns a reference to the "DeInter" whitelist, which determines
547           which inline tags the filter should attempt to automatically de-
548           interleave if they are encountered interleaved.  For example, the
549           filter will transform:
550
551             <b>hello <i>world</b> !</i>
552
553           Into:
554
555             <b>hello <i>world</i></b><i> !</i>
556
557           because both "b" and "i" appear as keys in the "DeInter" whitelist.
558

CHARACTER DATA PROCESSING

560       These methods transform attribute values and non-tag text from the
561       input document into canonical form (see "CANONICAL FORM"), and
562       transform text in canonical form into a suitable form for the output
563       document.
564
565       text_to_canonical_form ( TEXT )
566           This method is used to reduce non-tag text from the input document
567           to canonical form before passing it to the filter_text() method.
568
569           The default implementation unescapes all entities that map to
570           "US-ASCII" characters other than ampersand, and replaces any
571           ampersands that don't form part of valid entities with "&amp;".
572
573       quoted_to_canonical_form ( VALUE )
574           This method is used to reduce attribute values quoted with
575           doublequotes or singlequotes to canonical form before passing it to
576           the handler subs in the "AttVal" whitelist.
577
578           The default behavior is the same as that of
579           text_to_canonical_form(), plus it converts any CR, LF or TAB
580           characters to spaces.
581
582       unquoted_to_canonical_form ( VALUE )
583           This method is used to reduce attribute values without quotes to
584           canonical form before passing it to the handler subs in the
585           "AttVal" whitelist.
586
587           The default implementation simply replaces all ampersands with
588           "&amp;", since that corresponds with the way most browsers treat
589           entities in unquoted values.
590
591       canonical_form_to_text ( TEXT )
592           This method is used to convert the text in canonical form returned
593           by the filter_text() method to a form suitable for inclusion in the
594           output document.
595
596           The default implementation runs anything that doesn't look like a
597           valid entity through the escape_html_metachars() method.
598
599       canonical_form_to_attval ( ATTVAL )
600           This method is used to convert the text in canonical form returned
601           by the "AttVal" handler subs to a form suitable for inclusion in
602           doublequotes in the output tag.
603
604           The default implementation converts CR, LF and TAB characters to a
605           single space, and runs anything that doesn't look like a valid
606           entity through the escape_html_metachars() method.
607
608       validate_href_attribute ( TEXT )
609           If the "AllowHref" filter configuration option is set, then this
610           method is used to validate "href" type attribute values.  TEXT is
611           the attribute value in canonical form.  Returns a possibly modified
612           attribute value (in canonical form) or "undef" to reject the
613           attribute.
614
615           The default implementation allows only absolute "http" and "https"
616           URLs, permits port numbers and query strings, and imposes
617           reasonable length limits.
618
619           It does not URI escape the query string, and it does not guarantee
620           properly formatted URIs, it just tries to give safe URIs. You can
621           always use an attribute callback (see "Attribute Callbacks") to
622           provide stricter handling.
623
624       validate_mailto ( TEXT )
625           If the "AllowMailto" filter configuration option is set, then this
626           method is used to validate "href" type attribute values which begin
627           with "mailto:".  TEXT is the attribute value in canonical form.
628           Returns a possibly modified attribute value (in canonical form) or
629           "undef" to reject the attribute.
630
631           This uses a lightweight regex and does not guarantee that email
632           addresses are properly formatted. You can always use an attribute
633           callback (see "Attribute Callbacks") to provide stricter handling.
634
635       validate_src_attribute ( TEXT )
636           If the "AllowSrc" filter configuration option is set, then this
637           method is used to validate "src" type attribute values.  TEXT is
638           the attribute value in canonical form.  Returns a possibly modified
639           attribute value (in canonical form) or "undef" to reject the
640           attribute.
641
642           The default implementation behaves as validate_href_attribute().
643

OTHER METHODS TO OVERRIDE

645       As well as the output, reject, init and cdata methods listed above, it
646       might make sense for subclasses to override the following methods:
647
648       filter_text ( TEXT )
649           This method will be invoked to filter blocks of non-tag text in the
650           input document.  Both input and output are in canonical form, see
651           "CANONICAL FORM".
652
653           The default implementation does no filtering.
654
655       escape_html_metachars ( TEXT )
656           This method is used to escape all HTML metacharacters in TEXT.  The
657           return value must be a copy of TEXT with metacharacters escaped.
658
659           The default implementation escapes a minimal set of metacharacters
660           for security against XSS vulnerabilities.  The set of characters to
661           escape is a compromise between the need for security and the need
662           to ensure that the filter will work for documents in as many
663           different character sets as possible.
664
665           Subclasses which make strong assumptions about the document
666           character set will be able to escape much more aggressively.
667
668       strip_nonprintable ( TEXT )
669           Returns a copy of TEXT with runs of nonprintable characters
670           replaced with spaces or some other harmless string.  Avoids
671           replacing anything with the empty string, as that can lead to other
672           security issues.
673
674           The default implementation strips out only NULL characters, in
675           order to avoid scrambling text for as many different character sets
676           as possible.
677
678           Subclasses which make some sort of assumption about the character
679           set in use will be able to have a much wider definition of a
680           nonprintable character, and hence a more secure
681           strip_nonprintable() implementation.
682

ATTRIBUTE VALUE HANDLER SUBS

684       References to the following subs appear in the "AttVal" whitelist
685       returned by the init_attval_whitelist() method.
686
687       _hss_attval_style( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
688           Attribute value hander for the "style" attribute.
689
690       _hss_attval_size ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
691           Attribute value handler for attributes who's values are some sort
692           of size or length.
693
694       _hss_attval_number ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
695           Attribute value handler for attributes who's values are a simple
696           integer.
697
698       _hss_attval_color ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
699           Attribute value handler for color attributes.
700
701       _hss_attval_text ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
702           Attribute value handler for text attributes.
703
704       _hss_attval_word ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
705           Attribute value handler for attributes who's values must consist of
706           a single short word, with minus characters permitted.
707
708       _hss_attval_wordlist ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
709           Attribute value handler for attributes who's values must consist of
710           one or more words, separated by spaces and/or commas.
711
712       _hss_attval_wordlistq ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
713           Attribute value handler for attributes who's values must consist of
714           one or more words, separated by commas, with optional doublequotes
715           around words and spaces allowed within the doublequotes.
716
717       _hss_attval_href ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
718           Attribute value handler for "href" type attributes.  If the
719           "AllowHref" or "AllowMailto" configuration options are set, uses
720           the validate_href_attribute() method to check the attribute value.
721
722       _hss_attval_src ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
723           Attribute value handler for "src" type attributes.  If the
724           "AllowSrc" configuration option is set, uses the
725           validate_src_attribute() method to check the attribute value.
726
727       _hss_attval_stylesrc ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
728           Attribute value handler for "src" type style pseudo attributes.
729
730       _hss_attval_novalue ( FILTER, TAGNAME, ATTRNAME, ATTRVAL )
731           Attribute value handler for attributes that have no value or a
732           value that is ignored.  Just returns the attribute name as the
733           value.
734

CANONICAL FORM

736       Many of the methods described above deal with text from the input
737       document, encoded in what I call "canonical form", defined as follows:
738
739       All characters other than ampersands represent themselves.  Literal
740       ampersands are encoded as "&amp;".  Non "US-ASCII" characters may
741       appear as literals in whatever character set is in use, or they may
742       appear as named or numeric HTML entities such as "&aelig;", "&#31337;"
743       and "&#xFF;".  Unknown named entities such as "&foo;" may appear.
744
745       The idea is to be able to be able to reduce input text to a minimal
746       form, without making too many assumptions about the character set in
747       use.
748

PRIVATE METHODS

750       The following methods are internal to this class, and should not be
751       invoked from elsewhere.  Subclasses should not use or override these
752       methods.
753
754       _hss_prepare_ban_list (CFG)
755           Returns a hash ref representing all the banned tags, based on the
756           values of BanList and BanAllBut
757
758       _hss_prepare_rules (CFG)
759           Returns a hash ref representing the tag and attribute rules (See
760           "Rules").
761
762           Returns undef if no filters are specified, in which case the
763           attribute filter code has very little performance impact. If any
764           rules are specified, then every tag and attribute is checked.
765
766       _hss_get_attr_filter ( DEFAULT_FILTERS TAG_FILTERS ATTR_NAME)
767           Returns the attribute filter rule to apply to this particular
768           attribute.
769
770           Checks for:
771
772             - a named attribute rule in a named tag
773             - a default * attribute rule in a named tag
774             - a named attribute rule in the default * rules
775             - a default * attribute rule in the default * rules
776
777       _hss_join_attribs (FILTERED_ATTRIBS)
778           Accepts a hash ref containing the attribute names as the keys, and
779           the attribute values as the values.  Escapes them and returns a
780           string ready for output to HTML
781
782       _hss_decode_numeric ( NUMERIC )
783           Returns the string that should replace the numeric entity NUMERIC
784           in the text_to_canonical_form() method.
785
786       _hss_tag_is_banned ( TAGNAME )
787           Returns true if the lower case tag name TAGNAME is on the list of
788           harmless tags that the filter is configured to block, false
789           otherwise.
790
791       _hss_get_to_valid_context ( TAG )
792           Tries to get the filter to a context in which the tag TAG is
793           allowed, by introducing extra end tags or start tags if necessary.
794           TAG can be either the lower case name of a tag or the string
795           'CDATA'.
796
797           Returns 1 if an allowed context is reached, or 0 if there's no
798           reasonable way to get to an allowed context and the tag should just
799           be rejected.
800
801       _hss_close_innermost_tag ()
802           Closes the innermost open tag.
803
804       _hss_context ()
805           Returns the current named context of the filter.
806
807       _hss_valid_in_context ( TAG, CONTEXT )
808           Returns true if the lowercase tag name TAG is valid in context
809           CONTEXT, false otherwise.
810
811       _hss_valid_in_current_context ( TAG )
812           Returns true if the lowercase tag name TAG is valid in the filter's
813           current context, false otherwise.
814

BUGS AND LIMITATIONS

816       Performance
817           This module does a lot of work to ensure that tags are correctly
818           nested and are not left open, causing unnecessary overhead for
819           applications where that doesn't matter.
820
821           Such applications may benefit from using the more lightweight
822           HTML::Scrubber::StripScripts module instead.
823
824       Strictness
825           URIs and email addresses are cleaned up to be safe, but not
826           necessarily accurate.  That would have required adding
827           dependencies.  Attribute callbacks can be used to add this
828           functionality if required, or the validation methods can be
829           overridden.
830
831           By default, filtered HTML may not be valid strict XHTML, for
832           instance empty required attributes may be outputted.  However, with
833           "Rules", it should be possible to force the HTML to validate.
834
835       REPORTING BUGS
836           Please report any bugs or feature requests to
837           bug-html-stripscripts@rt.cpan.org, or through the web interface at
838           <http://rt.cpan.org>.
839

SEE ALSO

841       HTML::Parser, HTML::StripScripts::Parser, HTML::StripScripts::Regex
842

AUTHOR

844       Original author Nick Cleaton <nick@cleaton.net>
845
846       New code added and module maintained by Clinton Gormley
847       <clint@traveljury.com>
848
850       Copyright (C) 2003 Nick Cleaton.  All Rights Reserved.
851
852       Copyright (C) 2007 Clinton Gormley.  All Rights Reserved.
853

LICENSE

855       This module is free software; you can redistribute it and/or modify it
856       under the same terms as Perl itself.
857
858
859
860perl v5.36.1                      2023-06-07             HTML::StripScripts(3)
Impressum