1HTML::FormatExternal(3)User Contributed Perl DocumentatioHnTML::FormatExternal(3)
2
3
4

NAME

6       HTML::FormatExternal - HTML to text formatting using external programs
7

DESCRIPTION

9       This is a collection of formatter modules which turn HTML into plain
10       text by dumping it through the respective external programs.
11
12           HTML::FormatText::Elinks
13           HTML::FormatText::Html2text
14           HTML::FormatText::Links
15           HTML::FormatText::Lynx
16           HTML::FormatText::Netrik
17           HTML::FormatText::Vilistextum
18           HTML::FormatText::W3m
19           HTML::FormatText::Zen
20
21       The module interfaces are compatible with "HTML::Formatter" modules
22       such as "HTML::FormatText", but the external programs do all the work.
23
24       Common formatting options are used where possible, such as "leftmargin"
25       and "rightmargin".  So just by switching the class you can use a
26       different program (or the plain "HTML::FormatText") according to
27       personal preference, or strengths and weaknesses, or what you've got.
28
29       There's nothing particularly difficult about piping through these
30       programs, but a unified interface hides details like how to set margins
31       and how to force input or output charsets.
32

FUNCTIONS

34       Each of the classes above provide the following functions.  The "XXX"
35       in the class names here is a placeholder for any of "Elinks", "Lynx",
36       etc as above.
37
38       See examples/demo.pl in the HTML-FormatExternal sources for a complete
39       sample program.
40
41   Formatter Compatible Functions
42       "$text = HTML::FormatText::XXX->format_file ($filename,
43       key=>value,...)"
44       "$text = HTML::FormatText::XXX->format_string ($html_string,
45       key=>value,...)"
46           Run the formatter program over a file or string with the given
47           options and return the formatted result as a string.  See "OPTIONS"
48           below for possible key/value options.  For example,
49
50               $text = HTML::FormatText::Lynx->format_file ('/my/file.html');
51
52               $text = HTML::FormatText::W3m->format_string
53                 ('<html><body> <p> Hello world! </p </body></html>');
54
55           format_file() ensures any $filename is interpreted as a filename
56           (by escaping as necessary against however the programs interpret
57           command line arguments).
58
59       "$formatter = HTML::FormatText::XXX->new (key=>value, ...)"
60           Create a formatter object with the given options.  In the current
61           implementation an object doesn't do much more than remember the
62           options for future use.
63
64               $formatter = HTML::FormatText::Elinks->new(rightmargin => 60);
65
66       "$text = $formatter->format ($tree_or_string)"
67           Run the $formatter program on a "HTML::TreeBuilder" tree or a
68           string, using the options in $formatter, and return the result as a
69           string.
70
71           A TreeBuilder argument (ie. a "HTML::Element") is accepted for
72           compatibility with "HTML::Formatter".  The tree is simply turned
73           into a string with "$tree->as_HTML" to pass to the program, so if
74           you've got a string already then give that instead of a tree.
75
76           "HTML::Element" itself has a format() method (see "format" in
77           HTML::Element) which runs a given $formatter.  A
78           "HTML::FormatExternal" object can be used for $formatter.
79
80               $text = $tree->format($formatter);
81
82               # which dispatches to
83               $text = $formatter->format($tree);
84
85   Extra Functions
86       The following are extra methods not available in the plain
87       "HTML::FormatText".
88
89       "HTML::FormatText::XXX->program_version ()"
90       "HTML::FormatText::XXX->program_full_version ()"
91       "$formatter->program_version ()"
92       "$formatter->program_full_version ()"
93           Return the version number of the formatter program as reported by
94           its "--version" or similar option.  If the formatter program is not
95           available then return "undef".
96
97           program_version() is the bare version number, perhaps with "beta"
98           or similar indication.  program_full_version() is the entire
99           version output, which may include build options, copyright notice,
100           etc.
101
102               $str = HTML::FormatText::Lynx->program_version();
103               # eg. "2.8.7dev.10"
104
105               $str = HTML::FormatText::W3m->program_full_version();
106               # eg. "w3m version w3m/0.5.2, options lang=en,m17n,image,..."
107
108           The version number of the respective Perl module itself is
109           available in the usual way (see "VERSION" in UNIVERSAL).
110
111               $modulever = HTML::FormatText::Netrik->VERSION;
112               $modulever = $formatter->VERSION
113

CHARSETS

115       File or byte string input is by default interpreted by the programs in
116       their usual ways.  This should mean HTML Latin-1 but user
117       configurations might override that and some programs recognise a
118       "<meta>" charset declaration or a Unicode BOM.  The "input_charset"
119       option below can force the input charset.
120
121       Perl wide-character input string is encoded and passed to the program
122       in whatever way it best understands.  Usually this is UTF-8 but in some
123       cases it is entitized instead.  The "input_charset" option can force
124       the input charset to use if for some reason UTF-8 is not best.
125
126       The output string is either bytes or wide chars.  By default output is
127       the same as input, so wide char string input gives wide output and byte
128       input string or file input gives byte output.  The "output_wide" option
129       can force the output type (and is the way to get wide chars back from
130       format_file()).
131
132       Byte output is whatever the program produces.  Its default might be the
133       locale charset or other user configuration which suits direct display
134       to the user's terminal.  The "output_charset" option can force the
135       output to be certain or to be ready for further processing.
136
137       Wide char output is done by choosing the best output charset the
138       program can do and decoding its output.  Usually this means UTF-8 but
139       some of the programs may only have less.  The "output_charset" option
140       can force the charset used and decoded.  If it's something less than
141       UTF-8 then some programs might for example give ASCII art
142       approximations of otherwise unrepresentable characters.
143
144       Byte input is usual for HTML downloaded from a HTTP server or from a
145       MIME email and the headers have the "input_charset" which applies.
146       Byte output is good to go straight out to a tty or back to more MIME
147       etc.  The input and output charsets could differ if a server gives
148       something other than what you want for final output.
149
150       Wide chars are most convenient for crunching text within Perl.  The
151       default wide input giving wide output is designed to be transparent for
152       this.
153
154       For reference, if a "HTML::Element" tree contains wide char strings
155       then its usual as_HTML() method, which is used by format() above,
156       produces wide char HTML so the formatters here give wide char text.
157       Actually as_HTML() produces all ASCII because its default behaviour is
158       to entitize anything "unsafe", but it's still a wide char string so the
159       formatted output text is wide.
160

OPTIONS

162       The following options can be given to the constructor or to the
163       formatting methods.  The defaults are whatever the respective programs
164       do.  The programs generally read their config files when dumping so the
165       defaults and formatting details may follow the user's personal
166       preferences.  Usually this is a good thing.
167
168       "leftmargin => INTEGER"
169       "rightmargin => INTEGER"
170           The column numbers for the left and right hand ends of the text.
171           "leftmargin" 0 means no padding on the left.  "rightmargin" is the
172           text width, so for instance 60 would mean the longest line is 60
173           characters (inclusive of any "leftmargin").  These options are
174           compatible with "HTML::FormatText".
175
176           "rightmargin" is not necessarily a hard limit.  Some of the
177           programs will exceed it in a HTML literal "<pre>", or a run of
178           "&nbsp;" or similar.
179
180       "input_charset => STRING"
181           Force the HTML input to be interpreted as bytes of the given
182           charset, irrespective of locale, user configuration, "<meta>" in
183           the HTML, etc.
184
185       "output_charset => STRING"
186           Force the text output to be encoded as the given charset.  The
187           default varies among the programs, but usually defaults to the
188           locale.
189
190       "output_wide => 0,1,"as_input""
191           Select output string as wide characters rather than bytes.  The
192           default is "as_input" which means a wide char input string results
193           in a wide char output string and a byte input or file input is byte
194           output.  See "CHARSETS" above for how wide characters work.
195
196           Bytes or wide chars output can be forced by 0 or 1 respectively.
197           For example to get wide char output when formatting a file,
198
199               $wide_char_text = HTML::FormatText::W3m->format_file
200                                  ('/my/file.html', output_wide => 1);
201
202       "base => STRING"
203           Set the base URL for any relative links within the HTML (similar to
204           "HTML::FormatText::WithLinks").  Usually this should be the
205           location the HTML was downloaded from.
206
207           If the document contains its own "<base>" setting then currently
208           the document takes precedence.  Only Lynx and Elinks display
209           absolutized link targets and the option has no effect on the other
210           programs.
211

TAINT MODE

213       The formatter modules can be used under "perl -T" taint mode.  They run
214       external programs so it's necessary to untaint $ENV{PATH} in the usual
215       way per "Cleaning Up Your Path" in perlsec.
216
217       The formatted text strings returned are always tainted, on the basis
218       that they use or include data from outside the Perl program.  The
219       program_version() and program_full_version() strings are tainted too.
220

BUGS

222       "leftmargin" is implemented by adding spaces to the program output.
223       For byte output it this is ASCII spaces and that will be badly wrong
224       for unusual output like UTF-16 which is not a byte superset of ASCII.
225       For wide char output the margin is applied after decoding to wide chars
226       so is correct.  It'd be better to ask the programs to do the margin but
227       their options for that are poor.
228
229       There's nothing done with errors or warning messages from the programs.
230       Generally they make a best effort on doubtful HTML, but fatal errors
231       like bad options or missing libraries ought to be somehow trapped.
232

OTHER POSSIBILITIES

234       "elinks" (from Aug 2008 onwards) and "netrik" can produce ANSI escapes
235       for colours, underline, etc, and "html2text" and "lynx" can produce tty
236       style backspace overstriking.  This might be good for text destined for
237       a tty or further crunching.  Perhaps an "ansi" or "tty" option could
238       enable this, where possible, but for now it's deliberately turned off
239       in those programs to keep the default as plain text.
240

SEE ALSO

242       HTML::FormatText::Elinks, HTML::FormatText::Html2text,
243       HTML::FormatText::Links, HTML::FormatText::Netrik,
244       HTML::FormatText::Lynx, HTML::FormatText::Vilistextum,
245       HTML::FormatText::W3m, HTML::FormatText::Zen
246
247       HTML::FormatText, HTML::FormatText::WithLinks,
248       HTML::FormatText::WithLinks::AndTables
249

HOME PAGE

251       <http://user42.tuxfamily.org/html-formatexternal/index.html>
252

LICENSE

254       Copyright 2008, 2009, 2010, 2011, 2012, 2013, 2015 Kevin Ryde
255
256       HTML-FormatExternal is free software; you can redistribute it and/or
257       modify it under the terms of the GNU General Public License as
258       published by the Free Software Foundation; either version 3, or (at
259       your option) any later version.
260
261       HTML-FormatExternal is distributed in the hope that it will be useful,
262       but WITHOUT ANY WARRANTY; without even the implied warranty of
263       MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
264       General Public License for more details.
265
266       You should have received a copy of the GNU General Public License along
267       with HTML-FormatExternal.  If not, see <http://www.gnu.org/licenses/>.
268
269
270
271perl v5.36.0                      2023-01-20           HTML::FormatExternal(3)
Impressum