1Strip(3) User Contributed Perl Documentation Strip(3)
2
3
4
6 HTML::Strip - Perl extension for stripping HTML markup from text.
7
9 use HTML::Strip;
10
11 my $hs = HTML::Strip->new();
12
13 my $clean_text = $hs->parse( $raw_html );
14 $hs->eof;
15
17 This module simply strips HTML-like markup from text rapidly and
18 brutally. It could easily be used to strip XML or SGML markup instead
19 - but as removing HTML is a much more common problem, this module lives
20 in the HTML:: namespace.
21
22 It is written in XS, and thus about five times quicker than using
23 regular expressions for the same task.
24
25 It does not do any syntax checking. If you want that, use
26 HTML::Parser. Instead it merely applies the following rules:
27
28 1. Anything that looks like a tag, or group of tags will be replaced
29 with a single space character. Tags are considered to be anything
30 that starts with a "<" and ends with a ">"; with the caveat that a
31 ">" character may appear in either of the following without ending
32 the tag:
33
34 Quote
35 Quotes are considered to start with either a "'" or a """
36 character, and end with a matching character not preceded by an
37 even number or escaping slashes (i.e. "\"" does not end the
38 quote but "\\\\"" does).
39
40 Comment
41 If the tag starts with an exclamation mark, it is assumed to be
42 a declaration or a comment. Within such tags, ">" characters
43 do not end the tag if they appear within pairs of double dashes
44 (e.g. "<!-- <a href="old.htm">old page</a> -->" would be
45 stripped completely). No parsing for quotes is performed
46 within comments, so for instance "<!-- comment with both '
47 quote types " -->" would be entirely stripped.
48
49 2. Anything that appears between tags which we term strip tags is
50 removed. By default, these tags are "title", "script", "style" and
51 "applet".
52
53 HTML::Strip maintains state between calls, so you can parse a document
54 in chunks should you wish. If a call to parse() ends half-way through
55 a tag, quote or comment; the next call to parse() expects its input to
56 carry on from that point.
57
58 If this is not the behaviour you want, you can either call eof()
59 between calls to parse(), or set "auto_reset" to true (either on the
60 constructor or with "set_auto_reset") so that the parser will reset
61 after each call.
62
63 METHODS
64 new()
65 Constructor. Can optionally take a hash of settings (with keys
66 corresponding to the "set_" methods below).
67
68 Example:
69
70 my $hs = HTML::Strip->new(
71 striptags => [ 'script', 'iframe' ],
72 emit_spaces => 0
73 );
74
75 parse()
76 Takes a string as an argument, returns it stripped of HTML.
77
78 eof()
79 Resets the current state information, ready to parse a new block of
80 HTML.
81
82 clear_striptags()
83 Clears the current set of strip tags.
84
85 add_striptag()
86 Adds the string passed as an argument to the current set of strip
87 tags.
88
89 set_striptags()
90 Takes a reference to an array of strings, which replace the current
91 set of strip tags.
92
93 set_emit_spaces()
94 Takes a boolean value. If set to false, HTML::Strip will not
95 attempt any conversion of tags into spaces. Set to true by
96 default.
97
98 set_emit_newlines()
99 Takes a boolean value. If set to true, HTML::Strip will output
100 newlines after "<br>" and "<p>" tags. Set to false by default.
101
102 set_decode_entities()
103 Takes a boolean value. If set to false, HTML::Strip will not
104 decode HTML entities. Set to true by default.
105
106 filter_entities()
107 If HTML::Entities is available, this method behaves just like
108 invoking HTML::Entities::decode_entities, except that it respects
109 the current setting of 'decode_entities'.
110
111 set_filter()
112 Sets a filter to be applied after tags were stripped. It may
113 accept the name of a method (like 'filter_entities') or a code ref.
114 By default, its value is 'filter_entities' if HTML::Entities is
115 available or "undef" otherwise.
116
117 set_auto_reset()
118 Takes a boolean value. If set to true, "parse" resets after each
119 call (equivalent to calling "eof"). Otherwise, the parser
120 remembers its state from one call to "parse" to another, until you
121 call "eof" explicitly. Set to false by default.
122
123 set_debug()
124 Outputs extensive debugging information on internal state during
125 the parse. Not intended to be used by anyone except the module
126 maintainer.
127
128 decode_entities()
129 filter()
130 auto_reset()
131 debug()
132 Readonly accessors for their respective settings.
133
134 LIMITATIONS
135 Whitespace
136 Despite only outputting one space character per group of tags, and
137 avoiding doing so when tags are bordered by spaces or the start or
138 end of strings, HTML::Strip can often output more than desired;
139 such as with the following HTML:
140
141 <h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>
142
143 Which gives the following output:
144
145 " HTML::Strip fast, and brutal "
146
147 Thus, you may want to post-filter the output of HTML::Strip to
148 remove excess whitespace (for example, using "tr/ / /s;"). (This
149 has been improved since previous releases, but is still an issue)
150
151 HTML Entities
152 HTML::Strip will only attempt decoding of HTML entities if
153 HTML::Entities is installed.
154
155 EXPORT
156 None by default.
157
159 Alex Bowley <kilinrax@cpan.org>
160
162 perl, HTML::Parser, HTML::Entities
163
165 This program is free software; you can redistribute it and/or modify it
166 under the same terms as Perl itself.
167
168
169
170perl v5.38.0 2023-07-20 Strip(3)