HTML::Strip(3pm)

1Strip(3)              User Contributed Perl Documentation             Strip(3)
2
3
4

NAME

6       HTML::Strip - Perl extension for stripping HTML markup from text.
7

SYNOPSIS

9         use HTML::Strip;
10
11         my $hs = HTML::Strip->new();
12
13         my $clean_text = $hs->parse( $raw_html );
14         $hs->eof;
15

DESCRIPTION

17       This module simply strips HTML-like markup from text rapidly and
18       brutally.  It could easily be used to strip XML or SGML markup instead
19       - but as removing HTML is a much more common problem, this module lives
20       in the HTML:: namespace.
21
22       It is written in XS, and thus about five times quicker than using
23       regular expressions for the same task.
24
25       It does not do any syntax checking.  If you want that, use
26       HTML::Parser.  Instead it merely applies the following rules:
27
28       1.  Anything that looks like a tag, or group of tags will be replaced
29           with a single space character.  Tags are considered to be anything
30           that starts with a "<" and ends with a ">"; with the caveat that a
31           ">" character may appear in either of the following without ending
32           the tag:
33
34           Quote
35               Quotes are considered to start with either a "'" or a """
36               character, and end with a matching character not preceded by an
37               even number or escaping slashes (i.e. "\"" does not end the
38               quote but "\\\\"" does).
39
40           Comment
41               If the tag starts with an exclamation mark, it is assumed to be
42               a declaration or a comment.   Within such tags, ">" characters
43               do not end the tag if they appear within pairs of double dashes
44               (e.g. "<!-- <a href="old.htm">old page</a> -->" would be
45               stripped completely).  No parsing for quotes is performed
46               within comments, so for instance "<!-- comment with both '
47               quote types " -->" would be entirely stripped.
48
49       2.  Anything that appears between tags which we term strip tags is
50           removed.  By default, these tags are "title", "script", "style" and
51           "applet".
52
53       HTML::Strip maintains state between calls, so you can parse a document
54       in chunks should you wish.  If a call to parse() ends half-way through
55       a tag, quote or comment; the next call to parse() expects its input to
56       carry on from that point.
57
58       If this is not the behaviour you want, you can either call eof()
59       between calls to parse(), or set "auto_reset" to true (either on the
60       constructor or with "set_auto_reset") so that the parser will reset
61       after each call.
62
63   METHODS
64       new()
65           Constructor.  Can optionally take a hash of settings (with keys
66           corresponding to the "set_" methods below).
67
68           Example:
69
70            my $hs = HTML::Strip->new(
71                striptags   => [ 'script', 'iframe' ],
72                emit_spaces => 0
73            );
74
75       parse()
76           Takes a string as an argument, returns it stripped of HTML.
77
78       eof()
79           Resets the current state information, ready to parse a new block of
80           HTML.
81
82       clear_striptags()
83           Clears the current set of strip tags.
84
85       add_striptag()
86           Adds the string passed as an argument to the current set of strip
87           tags.
88
89       set_striptags()
90           Takes a reference to an array of strings, which replace the current
91           set of strip tags.
92
93       set_emit_spaces()
94           Takes a boolean value.  If set to false, HTML::Strip will not
95           attempt any conversion of tags into spaces.  Set to true by
96           default.
97
98       set_emit_newlines()
99           Takes a boolean value.  If set to true, HTML::Strip will output
100           newlines after "<br>" and "<p>" tags.  Set to false by default.
101
102       set_decode_entities()
103           Takes a boolean value.  If set to false, HTML::Strip will not
104           decode HTML entities.  Set to true by default.
105
106       filter_entities()
107           If HTML::Entities is available, this method behaves just like
108           invoking HTML::Entities::decode_entities, except that it respects
109           the current setting of 'decode_entities'.
110
111       set_filter()
112           Sets a filter to be applied after tags were stripped.  It may
113           accept the name of a method (like 'filter_entities') or a code ref.
114           By default, its value is 'filter_entities' if HTML::Entities is
115           available or "undef" otherwise.
116
117       set_auto_reset()
118           Takes a boolean value.  If set to true, "parse" resets after each
119           call (equivalent to calling "eof").  Otherwise, the parser
120           remembers its state from one call to "parse" to another, until you
121           call "eof" explicitly.  Set to false by default.
122
123       set_debug()
124           Outputs extensive debugging information on internal state during
125           the parse.  Not intended to be used by anyone except the module
126           maintainer.
127
128       decode_entities()
129       filter()
130       auto_reset()
131       debug()
132           Readonly accessors for their respective settings.
133
134   LIMITATIONS
135       Whitespace
136           Despite only outputting one space character per group of tags, and
137           avoiding doing so when tags are bordered by spaces or the start or
138           end of strings, HTML::Strip can often output more than desired;
139           such as with the following HTML:
140
141            <h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>
142
143           Which gives the following output:
144
145           " HTML::Strip    fast, and brutal   "
146
147           Thus, you may want to post-filter the output of HTML::Strip to
148           remove excess whitespace (for example, using "tr/ / /s;").  (This
149           has been improved since previous releases, but is still an issue)
150
151       HTML Entities
152           HTML::Strip will only attempt decoding of HTML entities if
153           HTML::Entities is installed.
154
155   EXPORT
156       None by default.
157

AUTHOR

159       Alex Bowley <kilinrax@cpan.org>
160

LICENSE

165       This program is free software; you can redistribute it and/or modify it
166       under the same terms as Perl itself.
167
168
169
170perl v5.38.0                      2023-07-20                          Strip(3)

NAME

SYNOPSIS

DESCRIPTION

AUTHOR

SEE ALSO

LICENSE