HTML::Strip(3pm)

1Strip(3)              User Contributed Perl Documentation             Strip(3)
2
3
4

NAME

6       HTML::Strip - Perl extension for stripping HTML markup from text.
7

SYNOPSIS

9         use HTML::Strip;
10
11         my $hs = HTML::Strip->new();
12
13         my $clean_text = $hs->parse( $raw_html );
14         $hs->eof;
15

DESCRIPTION

17       This module simply strips HTML-like markup from text in a very quick
18       and brutal manner. It could quite easily be used to strip XML or SGML
19       from text as well; but removing HTML markup is a much more common
20       problem, hence this module lives in the HTML:: namespace.
21
22       It is written in XS, and thus about five times quicker than using
23       regular expressions for the same task.
24
25       It does not do any syntax checking (if you want that, use
26       HTML::Parser), instead it merely applies the following rules:
27
28       1.  Anything that looks like a tag, or group of tags will be replaced
29           with a single space character. Tags are considered to be anything
30           that starts with a "<" and ends with a ">"; with the caveat that a
31           ">" character may appear in either of the following without ending
32           the tag:
33
34           Quote
35               Quotes are considered to start with either a "'" or a """
36               character, and end with a matching character not preceded by an
37               even number or escaping slashes (i.e. "\"" does not end the
38               quote but "\\\\"" does).
39
40           Comment
41               If the tag starts with an exclamation mark, it is assumed to be
42               a declaration or a comment. Within such tags, ">" characters do
43               not end the tag if they appear within pairs of double dashes
44               (e.g. "<!-- <a href="old.htm">old page</a> -->" would be
45               stripped completely).
46
47       2.  Anything the appears within so-called strip tags is stripped as
48           well. By default, these tags are "title", "script", "style" and
49           "applet".
50
51       HTML::Strip maintains state between calls, so you can parse a document
52       in chunks should you wish. If one chunk ends half-way through a tag,
53       quote, comment, or whatever; it will remember this, and expect the next
54       call to parse to start with the remains of said tag.
55
56       If this is not going to be the case, be sure to call $hs->eof() between
57       calls to $hs->parse().
58
59   METHODS
60       new()
61           Constructor. Can optionally take a hash of settings (with keys
62           corresponsing to the "set_" methods below).
63
64           For example, the following is a valid constructor:
65
66            my $hs = HTML::Strip->new(
67                                      striptags   => [ 'script', 'iframe' ],
68                                      emit_spaces => 0
69                                     );
70
71       parse()
72           Takes a string as an argument, returns it stripped of HTML.
73
74       eof()
75           Resets the current state information, ready to parse a new block of
76           HTML.
77
78       clear_striptags()
79           Clears the current set of strip tags.
80
81       add_striptag()
82           Adds the string passed as an argument to the current set of strip
83           tags.
84
85       set_striptags()
86           Takes a reference to an array of strings, which replace the current
87           set of strip tags.
88
89       set_emit_spaces()
90           Takes a boolean value. If set to false, HTML::Strip will not
91           attempt any conversion of tags into spaces. Set to true by default.
92
93       set_decode_entities()
94           Takes a boolean value. If set to false, HTML::Strip will decode
95           HTML entities. Set to true by default.
96
97   LIMITATIONS
98       Whitespace
99           Despite only outputting one space character per group of tags, and
100           avoiding doing so when tags are bordered by spaces or the start or
101           end of strings, HTML::Strip can often output more than desired;
102           such as with the following HTML:
103
104            <h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>
105
106           Which gives the following output:
107
108           " HTML::Strip    fast, and brutal   "
109
110           Thus, you may want to post-filter the output of HTML::Strip to
111           remove excess whitespace (for example, using "tr/ / /s;").  (This
112           has been improved since previous releases, but is still an issue)
113
114       HTML Entities
115           HTML::Strip will only attempt decoding of HTML entities if
116           HTML::Entities is installed.
117
118   EXPORT
119       None by default.
120

AUTHOR

122       Alex Bowley <kilinrax@cpan.org>
123

POD ERRORS

128       Hey! The above document had some coding errors, which are explained
129       below:
130
131       Around line 161:
132           '=item' outside of any '=over'
133
134       Around line 204:
135           You forgot a '=back' before '=head2'
136
137       Around line 230:
138           You forgot a '=back' before '=head2'
139
140
141
142perl v5.12.0                      2006-02-10                          Strip(3)

NAME

SYNOPSIS

DESCRIPTION

AUTHOR

SEE ALSO

POD ERRORS