1Strip(3) User Contributed Perl Documentation Strip(3)
2
3
4
6 HTML::Strip - Perl extension for stripping HTML markup from text.
7
9 use HTML::Strip;
10
11 my $hs = HTML::Strip->new();
12
13 my $clean_text = $hs->parse( $raw_html );
14 $hs->eof;
15
17 This module simply strips HTML-like markup from text rapidly and
18 brutally. It could easily be used to strip XML or SGML markup instead;
19 but as removing HTML is a much more common problem, this module lives
20 in the HTML:: namespace.
21
22 It is written in XS, and thus about five times quicker than using
23 regular expressions for the same task.
24
25 It does not do any syntax checking (if you want that, use
26 HTML::Parser), instead it merely applies the following rules:
27
28 1. Anything that looks like a tag, or group of tags will be replaced
29 with a single space character. Tags are considered to be anything
30 that starts with a "<" and ends with a ">"; with the caveat that a
31 ">" character may appear in either of the following without ending
32 the tag:
33
34 Quote
35 Quotes are considered to start with either a "'" or a """
36 character, and end with a matching character not preceded by an
37 even number or escaping slashes (i.e. "\"" does not end the
38 quote but "\\\\"" does).
39
40 Comment
41 If the tag starts with an exclamation mark, it is assumed to be
42 a declaration or a comment. Within such tags, ">" characters
43 do not end the tag if they appear within pairs of double dashes
44 (e.g. "<!-- <a href="old.htm">old page</a> -->" would be
45 stripped completely). No parsing for quotes is performed
46 within comments, so for instance "<!-- comment with both '
47 quote types " -->" would be entirely stripped.
48
49 2. Anything the appears within what we term strip tags is stripped as
50 well. By default, these tags are "title", "script", "style" and
51 "applet".
52
53 HTML::Strip maintains state between calls, so you can parse a document
54 in chunks should you wish. If one chunk ends half-way through a tag,
55 quote, comment, or whatever; it will remember this, and expect the next
56 call to parse to start with the remains of said tag.
57
58 If this is not going to be the case, be sure to call $hs->eof() between
59 calls to $hs->parse(). Alternatively, you may set "auto_reset" to
60 true on the constructor or any time after with "set_auto_reset", so
61 that the parser will always operate in one-shot basis (resetting after
62 each parsed chunk).
63
64 METHODS
65 new()
66 Constructor. Can optionally take a hash of settings (with keys
67 corresponsing to the "set_" methods below).
68
69 For example, the following is a valid constructor:
70
71 my $hs = HTML::Strip->new(
72 striptags => [ 'script', 'iframe' ],
73 emit_spaces => 0
74 );
75
76 parse()
77 Takes a string as an argument, returns it stripped of HTML.
78
79 eof()
80 Resets the current state information, ready to parse a new block of
81 HTML.
82
83 clear_striptags()
84 Clears the current set of strip tags.
85
86 add_striptag()
87 Adds the string passed as an argument to the current set of strip
88 tags.
89
90 set_striptags()
91 Takes a reference to an array of strings, which replace the current
92 set of strip tags.
93
94 set_emit_spaces()
95 Takes a boolean value. If set to false, HTML::Strip will not
96 attempt any conversion of tags into spaces. Set to true by
97 default.
98
99 set_decode_entities()
100 Takes a boolean value. If set to false, HTML::Strip will decode
101 HTML entities. Set to true by default.
102
103 filter_entities()
104 If HTML::Entities is available, this method behaves just like
105 invoking HTML::Entities::decode_entities, except that it respects
106 the current setting of 'decode_entities'.
107
108 set_filter()
109 Sets a filter to be applied after tags were stripped. It may
110 accept the name of a method (like 'filter_entities') or a code ref.
111 By default, its value is 'filter_entities' if HTML::Entities is
112 available or "undef" otherwise.
113
114 set_auto_reset()
115 Takes a boolean value. If set to true, "parse" resets after each
116 call (equivalent to calling "eof"). Otherwise, the parser
117 remembers its state from one call to "parse" to another, until you
118 call "eof" explicitly. Set to false by default.
119
120 set_debug()
121 Outputs extensive debugging information on internal state during
122 the parse. Not intended to be used by anyone except the module
123 maintainer.
124
125 decode_entities()
126 filter()
127 auto_reset()
128 debug()
129 Readonly accessors for their respective settings.
130
131 LIMITATIONS
132 Whitespace
133 Despite only outputting one space character per group of tags, and
134 avoiding doing so when tags are bordered by spaces or the start or
135 end of strings, HTML::Strip can often output more than desired;
136 such as with the following HTML:
137
138 <h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>
139
140 Which gives the following output:
141
142 " HTML::Strip fast, and brutal "
143
144 Thus, you may want to post-filter the output of HTML::Strip to
145 remove excess whitespace (for example, using "tr/ / /s;"). (This
146 has been improved since previous releases, but is still an issue)
147
148 HTML Entities
149 HTML::Strip will only attempt decoding of HTML entities if
150 HTML::Entities is installed.
151
152 EXPORT
153 None by default.
154
156 Alex Bowley <kilinrax@cpan.org>
157
159 perl, HTML::Parser, HTML::Entities
160
162 This program is free software; you can redistribute it and/or modify it
163 under the same terms as Perl itself.
164
165
166
167perl v5.34.0 2022-01-21 Strip(3)