HTML::Strip(3pm)

1Strip(3)              User Contributed Perl Documentation             Strip(3)
2
3
4

NAME

6       HTML::Strip - Perl extension for stripping HTML markup from text.
7

SYNOPSIS

9         use HTML::Strip;
10
11         my $hs = HTML::Strip->new();
12
13         my $clean_text = $hs->parse( $raw_html );
14         $hs->eof;
15

DESCRIPTION

17       This module simply strips HTML-like markup from text rapidly and
18       brutally.  It could easily be used to strip XML or SGML markup instead;
19       but as removing HTML is a much more common problem, this module lives
20       in the HTML:: namespace.
21
22       It is written in XS, and thus about five times quicker than using
23       regular expressions for the same task.
24
25       It does not do any syntax checking (if you want that, use
26       HTML::Parser), instead it merely applies the following rules:
27
28       1.  Anything that looks like a tag, or group of tags will be replaced
29           with a single space character.  Tags are considered to be anything
30           that starts with a "<" and ends with a ">"; with the caveat that a
31           ">" character may appear in either of the following without ending
32           the tag:
33
34           Quote
35               Quotes are considered to start with either a "'" or a """
36               character, and end with a matching character not preceded by an
37               even number or escaping slashes (i.e. "\"" does not end the
38               quote but "\\\\"" does).
39
40           Comment
41               If the tag starts with an exclamation mark, it is assumed to be
42               a declaration or a comment.   Within such tags, ">" characters
43               do not end the tag if they appear within pairs of double dashes
44               (e.g. "<!-- <a href="old.htm">old page</a> -->" would be
45               stripped completely).  No parsing for quotes is performed
46               within comments, so for instance "<!-- comment with both '
47               quote types " -->" would be entirely stripped.
48
49       2.  Anything the appears within what we term strip tags is stripped as
50           well.  By default, these tags are "title", "script", "style" and
51           "applet".
52
53       HTML::Strip maintains state between calls, so you can parse a document
54       in chunks should you wish.  If one chunk ends half-way through a tag,
55       quote, comment, or whatever; it will remember this, and expect the next
56       call to parse to start with the remains of said tag.
57
58       If this is not going to be the case, be sure to call $hs->eof() between
59       calls to $hs->parse().   Alternatively, you may set "auto_reset" to
60       true on the constructor or any time after with "set_auto_reset", so
61       that the parser will always operate in one-shot basis (resetting after
62       each parsed chunk).
63
64   METHODS
65       new()
66           Constructor.  Can optionally take a hash of settings (with keys
67           corresponsing to the "set_" methods below).
68
69           For example, the following is a valid constructor:
70
71            my $hs = HTML::Strip->new(
72                                      striptags   => [ 'script', 'iframe' ],
73                                      emit_spaces => 0
74                                     );
75
76       parse()
77           Takes a string as an argument, returns it stripped of HTML.
78
79       eof()
80           Resets the current state information, ready to parse a new block of
81           HTML.
82
83       clear_striptags()
84           Clears the current set of strip tags.
85
86       add_striptag()
87           Adds the string passed as an argument to the current set of strip
88           tags.
89
90       set_striptags()
91           Takes a reference to an array of strings, which replace the current
92           set of strip tags.
93
94       set_emit_spaces()
95           Takes a boolean value.  If set to false, HTML::Strip will not
96           attempt any conversion of tags into spaces.  Set to true by
97           default.
98
99       set_decode_entities()
100           Takes a boolean value.  If set to false, HTML::Strip will decode
101           HTML entities.  Set to true by default.
102
103       filter_entities()
104           If HTML::Entities is available, this method behaves just like
105           invoking HTML::Entities::decode_entities, except that it respects
106           the current setting of 'decode_entities'.
107
108       set_filter()
109           Sets a filter to be applied after tags were stripped.  It may
110           accept the name of a method (like 'filter_entities') or a code ref.
111           By default, its value is 'filter_entities' if HTML::Entities is
112           available or "undef" otherwise.
113
114       set_auto_reset()
115           Takes a boolean value.  If set to true, "parse" resets after each
116           call (equivalent to calling "eof").  Otherwise, the parser
117           remembers its state from one call to "parse" to another, until you
118           call "eof" explicitly.  Set to false by default.
119
120       set_debug()
121           Outputs extensive debugging information on internal state during
122           the parse.  Not intended to be used by anyone except the module
123           maintainer.
124
125       decode_entities()
126       filter()
127       auto_reset()
128       debug()
129           Readonly accessors for their respective settings.
130
131   LIMITATIONS
132       Whitespace
133           Despite only outputting one space character per group of tags, and
134           avoiding doing so when tags are bordered by spaces or the start or
135           end of strings, HTML::Strip can often output more than desired;
136           such as with the following HTML:
137
138            <h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>
139
140           Which gives the following output:
141
142           " HTML::Strip    fast, and brutal   "
143
144           Thus, you may want to post-filter the output of HTML::Strip to
145           remove excess whitespace (for example, using "tr/ / /s;").  (This
146           has been improved since previous releases, but is still an issue)
147
148       HTML Entities
149           HTML::Strip will only attempt decoding of HTML entities if
150           HTML::Entities is installed.
151
152   EXPORT
153       None by default.
154

AUTHOR

156       Alex Bowley <kilinrax@cpan.org>
157

LICENSE

162       This program is free software; you can redistribute it and/or modify it
163       under the same terms as Perl itself.
164
165
166
167perl v5.30.1                      2020-01-30                          Strip(3)

NAME

SYNOPSIS

DESCRIPTION

AUTHOR

SEE ALSO

LICENSE