1HTML::FormatExternal(3)User Contributed Perl DocumentatioHnTML::FormatExternal(3)
2
3
4
6 HTML::FormatExternal - HTML to text formatting using external programs
7
9 This is a collection of formatter modules which turn HTML into plain
10 text by dumping it through the respective external programs.
11
12 HTML::FormatText::Elinks
13 HTML::FormatText::Html2text
14 HTML::FormatText::Links
15 HTML::FormatText::Lynx
16 HTML::FormatText::Netrik
17 HTML::FormatText::Vilistextum
18 HTML::FormatText::W3m
19 HTML::FormatText::Zen
20
21 The module interfaces are compatible with "HTML::Formatter" modules
22 such as "HTML::FormatText", but the external programs do all the work.
23
24 Common formatting options are used where possible, such as "leftmargin"
25 and "rightmargin". So just by switching the class you can use a
26 different program (or the plain "HTML::FormatText") according to
27 personal preference, or strengths and weaknesses, or what you've got.
28
29 There's nothing particularly difficult about piping through these
30 programs, but a unified interface hides details like how to set margins
31 and how to force input or output charsets.
32
34 Each of the classes above provide the following functions. The "XXX"
35 in the class names here is a placeholder for any of "Elinks", "Lynx",
36 etc as above.
37
38 See examples/demo.pl in the HTML-FormatExternal sources for a complete
39 sample program.
40
41 Formatter Compatible Functions
42 "$text = HTML::FormatText::XXX->format_file ($filename,
43 key=>value,...)"
44 "$text = HTML::FormatText::XXX->format_string ($html_string,
45 key=>value,...)"
46 Run the formatter program over a file or string with the given
47 options and return the formatted result as a string. See "OPTIONS"
48 below for possible key/value options. For example,
49
50 $text = HTML::FormatText::Lynx->format_file ('/my/file.html');
51
52 $text = HTML::FormatText::W3m->format_string
53 ('<html><body> <p> Hello world! </p </body></html>');
54
55 "format_file()" ensures any $filename is interpreted as a filename
56 (by escaping as necessary against however the programs interpret
57 command line arguments).
58
59 "$formatter = HTML::FormatText::XXX->new (key=>value, ...)"
60 Create a formatter object with the given options. In the current
61 implementation an object doesn't do much more than remember the
62 options for future use.
63
64 $formatter = HTML::FormatText::Elinks->new(rightmargin => 60);
65
66 "$text = $formatter->format ($tree_or_string)"
67 Run the $formatter program on a "HTML::TreeBuilder" tree or a
68 string, using the options in $formatter, and return the result as a
69 string.
70
71 A TreeBuilder argument (ie. a "HTML::Element") is accepted for
72 compatibility with "HTML::Formatter". The tree is simply turned
73 into a string with "$tree->as_HTML" to pass to the program, so if
74 you've got a string already then give that instead of a tree.
75
76 "HTML::Element" itself has a "format()" method (see "format" in
77 HTML::Element) which runs a given $formatter. A
78 "HTML::FormatExternal" object can be used for $formatter.
79
80 $text = $tree->format($formatter);
81
82 # which dispatches to
83 $text = $formatter->format($tree);
84
85 Extra Functions
86 The following are extra methods not available in the plain
87 "HTML::FormatText".
88
89 "HTML::FormatText::XXX->program_version ()"
90 "HTML::FormatText::XXX->program_full_version ()"
91 "$formatter->program_version ()"
92 "$formatter->program_full_version ()"
93 Return the version number of the formatter program as reported by
94 its "--version" or similar option. If the formatter program is not
95 available then return "undef".
96
97 "program_version()" is the bare version number, perhaps with "beta"
98 or similar indication. "program_full_version()" is the entire
99 version output, which may include build options, copyright notice,
100 etc.
101
102 $str = HTML::FormatText::Lynx->program_version();
103 # eg. "2.8.7dev.10"
104
105 $str = HTML::FormatText::W3m->program_full_version();
106 # eg. "w3m version w3m/0.5.2, options lang=en,m17n,image,..."
107
108 The version number of the respective Perl module itself is
109 available in the usual way (see "VERSION" in UNIVERSAL).
110
111 $modulever = HTML::FormatText::Netrik->VERSION;
112 $modulever = $formatter->VERSION
113
115 File or byte string input is by default interpreted by the programs in
116 their usual ways. This should mean HTML Latin-1 but user
117 configurations might override that and some programs recognise a
118 "<meta>" charset declaration or a Unicode BOM. The "input_charset"
119 option below can force the input charset.
120
121 Perl wide-character input string is encoded and passed to the program
122 in whatever way it best understands. Usually this is UTF-8 but in some
123 cases it is entitized instead. The "input_charset" option can force
124 the input charset to use if for some reason UTF-8 is not best.
125
126 The output string is either bytes or wide chars. By default output is
127 the same as input, so wide char string input gives wide output and byte
128 input string or file input gives byte output. The "output_wide" option
129 can force the output type (and is the way to get wide chars back from
130 "format_file()").
131
132 Byte output is whatever the program produces. Its default might be the
133 locale charset or other user configuration which suits direct display
134 to the user's terminal. The "output_charset" option can force the
135 output to be certain or to be ready for further processing.
136
137 Wide char output is done by choosing the best output charset the
138 program can do and decoding its output. Usually this means UTF-8 but
139 some of the programs may only have less. The "output_charset" option
140 can force the charset used and decoded. If it's something less than
141 UTF-8 then some programs might for example give ASCII art
142 approximations of otherwise unrepresentable characters.
143
144 Byte input is usual for HTML downloaded from a HTTP server or from a
145 MIME email and the headers have the "input_charset" which applies.
146 Byte output is good to go straight out to a tty or back to more MIME
147 etc. The input and output charsets could differ if a server gives
148 something other than what you want for final output.
149
150 Wide chars are most convenient for crunching text within Perl. The
151 default wide input giving wide output is designed to be transparent for
152 this.
153
154 For reference, if a "HTML::Element" tree contains wide char strings
155 then its usual "as_HTML()" method, which is used by "format()" above,
156 produces wide char HTML so the formatters here give wide char text.
157 Actually "as_HTML()" produces all ASCII because its default behaviour
158 is to entitize anything "unsafe", but it's still a wide char string so
159 the formatted output text is wide.
160
162 The following options can be given to the constructor or to the
163 formatting methods. The defaults are whatever the respective programs
164 do. The programs generally read their config files when dumping so the
165 defaults and formatting details may follow the user's personal
166 preferences. Usually this is a good thing.
167
168 "leftmargin => INTEGER"
169 "rightmargin => INTEGER"
170 The column numbers for the left and right hand ends of the text.
171 "leftmargin" 0 means no padding on the left. "rightmargin" is the
172 text width, so for instance 60 would mean the longest line is 60
173 characters (inclusive of any "leftmargin"). These options are
174 compatible with "HTML::FormatText".
175
176 "rightmargin" is not necessarily a hard limit. Some of the
177 programs will exceed it in a HTML literal "<pre>", or a run of
178 " " or similar.
179
180 "input_charset => STRING"
181 Force the HTML input to be interpreted as bytes of the given
182 charset, irrespective of locale, user configuration, "<meta>" in
183 the HTML, etc.
184
185 "output_charset => STRING"
186 Force the text output to be encoded as the given charset. The
187 default varies among the programs, but usually defaults to the
188 locale.
189
190 "output_wide => 0,1,"as_input""
191 Select output string as wide characters rather than bytes. The
192 default is "as_input" which means a wide char input string results
193 in a wide char output string and a byte input or file input is byte
194 output. See "CHARSETS" above for how wide characters work.
195
196 Bytes or wide chars output can be forced by 0 or 1 respectively.
197 For example to get wide char output when formatting a file,
198
199 $wide_char_text = HTML::FormatText::W3m->format_file
200 ('/my/file.html', output_wide => 1);
201
202 "base => STRING"
203 Set the base URL for any relative links within the HTML (similar to
204 "HTML::FormatText::WithLinks"). Usually this should be the
205 location the HTML was downloaded from.
206
207 If the document contains its own "<base>" setting then currently
208 the document takes precedence. Only Lynx and Elinks display
209 absolutized link targets and the option has no effect on the other
210 programs.
211
213 The formatter modules can be used under "perl -T" taint mode. They run
214 external programs so it's necessary to untaint $ENV{PATH} in the usual
215 way per "Cleaning Up Your Path" in perlsec.
216
217 The formatted text strings returned are always tainted, on the basis
218 that they use or include data from outside the Perl program. The
219 "program_version()" and "program_full_version()" strings are tainted
220 too.
221
223 "leftmargin" is implemented by adding spaces to the program output.
224 For byte output it this is ASCII spaces and that will be badly wrong
225 for unusual output like UTF-16 which is not a byte superset of ASCII.
226 For wide char output the margin is applied after decoding to wide chars
227 so is correct. It'd be better to ask the programs to do the margin but
228 their options for that are poor.
229
230 There's nothing done with errors or warning messages from the programs.
231 Generally they make a best effort on doubtful HTML, but fatal errors
232 like bad options or missing libraries ought to be somehow trapped.
233
235 "elinks" (from Aug 2008 onwards) and "netrik" can produce ANSI escapes
236 for colours, underline, etc, and "html2text" and "lynx" can produce tty
237 style backspace overstriking. This might be good for text destined for
238 a tty or further crunching. Perhaps an "ansi" or "tty" option could
239 enable this, where possible, but for now it's deliberately turned off
240 in those programs to keep the default as plain text.
241
243 HTML::FormatText::Elinks, HTML::FormatText::Html2text,
244 HTML::FormatText::Links, HTML::FormatText::Netrik,
245 HTML::FormatText::Lynx, HTML::FormatText::Vilistextum,
246 HTML::FormatText::W3m, HTML::FormatText::Zen
247
248 HTML::FormatText, HTML::FormatText::WithLinks,
249 HTML::FormatText::WithLinks::AndTables
250
252 <http://user42.tuxfamily.org/html-formatexternal/index.html>
253
255 Copyright 2008, 2009, 2010, 2011, 2012, 2013, 2015 Kevin Ryde
256
257 HTML-FormatExternal is free software; you can redistribute it and/or
258 modify it under the terms of the GNU General Public License as
259 published by the Free Software Foundation; either version 3, or (at
260 your option) any later version.
261
262 HTML-FormatExternal is distributed in the hope that it will be useful,
263 but WITHOUT ANY WARRANTY; without even the implied warranty of
264 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
265 General Public License for more details.
266
267 You should have received a copy of the GNU General Public License along
268 with HTML-FormatExternal. If not, see <http://www.gnu.org/licenses/>.
269
270
271
272perl v5.36.0 2022-07-22 HTML::FormatExternal(3)