1IO::HTML(3)           User Contributed Perl Documentation          IO::HTML(3)
2
3
4

NAME

6       IO::HTML - Open an HTML file with automatic charset detection
7

VERSION

9       This document describes version 1.004 of IO::HTML, released September
10       26, 2020.
11

SYNOPSIS

13         use IO::HTML;                 # exports html_file by default
14         use HTML::TreeBuilder;
15
16         my $tree = HTML::TreeBuilder->new_from_file(
17                      html_file('foo.html')
18                    );
19
20         # Alternative interface:
21         open(my $in, '<:raw', 'bar.html');
22         my $encoding = IO::HTML::sniff_encoding($in, 'bar.html');
23

DESCRIPTION

25       IO::HTML provides an easy way to open a file containing HTML while
26       automatically determining its encoding.  It uses the HTML5 encoding
27       sniffing algorithm specified in section 8.2.2.2 of the draft standard.
28
29       The algorithm as implemented here is:
30
31       1.  If the file begins with a byte order mark indicating UTF-16LE,
32           UTF-16BE, or UTF-8, then that is the encoding.
33
34       2.  If the first $bytes_to_check bytes of the file contain a "<meta>"
35           tag that indicates the charset, and Encode recognizes the specified
36           charset name, then that is the encoding.  (This portion of the
37           algorithm is implemented by "find_charset_in".)
38
39           The "<meta>" tag can be in one of two formats:
40
41             <meta charset="...">
42             <meta http-equiv="Content-Type" content="...charset=...">
43
44           The search is case-insensitive, and the order of attributes within
45           the tag is irrelevant.  Any additional attributes of the tag are
46           ignored.  The first matching tag with a recognized encoding ends
47           the search.
48
49       3.  If the first $bytes_to_check bytes of the file are valid UTF-8
50           (with at least 1 non-ASCII character), then the encoding is UTF-8.
51
52       4.  If all else fails, use the default character encoding.  The HTML5
53           standard suggests the default encoding should be locale dependent,
54           but currently it is always "cp1252" unless you set
55           $IO::HTML::default_encoding to a different value.  Note:
56           "sniff_encoding" does not apply this step; only "html_file" does
57           that.
58

SUBROUTINES

60   html_file
61         $filehandle = html_file($filename, \%options);
62
63       This function (exported by default) is the primary entry point.  It
64       opens the file specified by $filename for reading, uses
65       "sniff_encoding" to find a suitable encoding layer, and applies it.  It
66       also applies the ":crlf" layer.  If the file begins with a BOM, the
67       filehandle is positioned just after the BOM.
68
69       The optional second argument is a hashref containing options.  The
70       possible keys are described under "find_charset_in".
71
72       If "sniff_encoding" is unable to determine the encoding, it defaults to
73       $IO::HTML::default_encoding, which is set to "cp1252" (a.k.a.
74       Windows-1252) by default.  According to the standard, the default
75       should be locale dependent, but that is not currently implemented.
76
77       It dies if the file cannot be opened, or if "sniff_encoding" cannot
78       determine the encoding and $IO::HTML::default_encoding has been set to
79       "undef".
80
81   html_file_and_encoding
82         ($filehandle, $encoding, $bom)
83           = html_file_and_encoding($filename, \%options);
84
85       This function (exported only by request) is just like "html_file", but
86       returns more information.  In addition to the filehandle, it returns
87       the name of the encoding used, and a flag indicating whether a byte
88       order mark was found (if $bom is true, the file began with a BOM).
89       This may be useful if you want to write the file out again (especially
90       in conjunction with the "html_outfile" function).
91
92       The optional second argument is a hashref containing options.  The
93       possible keys are described under "find_charset_in".
94
95       It dies if the file cannot be opened, or if "sniff_encoding" cannot
96       determine the encoding and $IO::HTML::default_encoding has been set to
97       "undef".
98
99       The result of calling "html_file_and_encoding" in scalar context is
100       undefined (in the C sense of there is no guarantee what you'll get).
101
102   html_outfile
103         $filehandle = html_outfile($filename, $encoding, $bom);
104
105       This function (exported only by request) opens $filename for output
106       using $encoding, and writes a BOM to it if $bom is true.  If $encoding
107       is "undef", it defaults to $IO::HTML::default_encoding.  $encoding may
108       be either an encoding name or an Encode::Encoding object.
109
110       It dies if the file cannot be opened, or if both $encoding and
111       $IO::HTML::default_encoding are "undef".
112
113   sniff_encoding
114         ($encoding, $bom) = sniff_encoding($filehandle, $filename, \%options);
115
116       This function (exported only by request) runs the HTML5 encoding
117       sniffing algorithm on $filehandle (which must be seekable, and should
118       have been opened in ":raw" mode).  $filename is used only for error
119       messages (if there's a problem using the filehandle), and defaults to
120       "file" if omitted.  The optional third argument is a hashref containing
121       options.  The possible keys are described under "find_charset_in".
122
123       It returns Perl's canonical name for the encoding, which is not
124       necessarily the same as the MIME or IANA charset name.  It returns
125       "undef" if the encoding cannot be determined.  $bom is true if the file
126       began with a byte order mark.  In scalar context, it returns only
127       $encoding.
128
129       The filehandle's position is restored to its original position
130       (normally the beginning of the file) unless $bom is true.  In that
131       case, the position is immediately after the BOM.
132
133       Tip: If you want to run "sniff_encoding" on a file you've already
134       loaded into a string, open an in-memory file on the string, and pass
135       that handle:
136
137         ($encoding, $bom) = do {
138           open(my $fh, '<', \$string);  sniff_encoding($fh)
139         };
140
141       (This only makes sense if $string contains bytes, not characters.)
142
143   find_charset_in
144         $encoding = find_charset_in($string_containing_HTML, \%options);
145
146       This function (exported only by request) looks for charset information
147       in a "<meta>" tag in a possibly-incomplete HTML document using the "two
148       step" algorithm specified by HTML5.  It does not look for a BOM.  The
149       "<meta>" tag must begin within the first $IO::HTML::bytes_to_check
150       bytes of the string.
151
152       It returns Perl's canonical name for the encoding, which is not
153       necessarily the same as the MIME or IANA charset name.  It returns
154       "undef" if no charset is specified or if the specified charset is not
155       recognized by the Encode module.
156
157       The optional second argument is a hashref containing options.  The
158       following keys are recognized:
159
160       "encoding"
161           If true, return the Encode::Encoding object instead of its name.
162           Defaults to false.
163
164       "need_pragma"
165           If true (the default), follow the HTML5 spec and examine the
166           "content" attribute only of "<meta http-equiv="Content-Type"".  If
167           set to 0, relax the HTML5 spec, and look for "charset=" in the
168           "content" attribute of every meta tag.
169

EXPORTS

171       By default, only "html_file" is exported.  Other functions may be
172       exported on request.
173
174       For people who prefer not to export functions, all functions beginning
175       with "html_" have an alias without that prefix (e.g. you can call
176       "IO::HTML::file(...)" instead of "IO::HTML::html_file(...)".  These
177       aliases are not exportable.
178
179       The following export tags are available:
180
181       ":all"
182           All exportable functions.
183
184       ":rw"
185           "html_file", "html_file_and_encoding", "html_outfile".
186

SEE ALSO

188       The HTML5 specification, section 8.2.2.2 Determining the character
189       encoding:
190       <http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding>
191

DIAGNOSTICS

193       "Could not read %s: %s"
194           The specified file could not be read from for the reason specified
195           by $!.
196
197       "Could not seek %s: %s"
198           The specified file could not be rewound for the reason specified by
199           $!.
200
201       "Failed to open %s: %s"
202           The specified file could not be opened for reading for the reason
203           specified by $!.
204
205       "No default encoding specified"
206           The "sniff_encoding" algorithm didn't find an encoding to use, and
207           you set $IO::HTML::default_encoding to "undef".
208

CONFIGURATION AND ENVIRONMENT

210       There are two global variables that affect IO::HTML.  If you need to
211       change them, you should do so using "local" if possible:
212
213         my $file = do {
214           # This file may define the charset later in the header
215           local $IO::HTML::bytes_to_check = 4096;
216           html_file(...);
217         };
218
219       $bytes_to_check
220           This is the number of bytes that "sniff_encoding" will read from
221           the stream.  It is also the number of bytes that "find_charset_in"
222           will search for a "<meta>" tag containing charset information.  It
223           must be a positive integer.
224
225           The HTML 5 specification recommends using the default value of
226           1024, but some pages do not follow the specification.
227
228       $default_encoding
229           This is the encoding that "html_file" and "html_file_and_encoding"
230           will use if no encoding can be detected by "sniff_encoding".  The
231           default value is "cp1252" (a.k.a. Windows-1252).
232
233           Setting it to "undef" will cause the file subroutines to croak if
234           "sniff_encoding" fails to determine the encoding.
235           ("sniff_encoding" itself does not use $default_encoding).
236

DEPENDENCIES

238       IO::HTML has no non-core dependencies for Perl 5.8.7+.  With earlier
239       versions of Perl 5.8, you need to upgrade Encode to at least version
240       2.10, and you may need to upgrade Exporter to at least version 5.57.
241

INCOMPATIBILITIES

243       None reported.
244

BUGS AND LIMITATIONS

246       No bugs have been reported.
247

AUTHOR

249       Christopher J. Madsen  "<perl AT cjmweb.net>"
250
251       Please report any bugs or feature requests to
252       "<bug-IO-HTML AT rt.cpan.org>" or through the web interface at
253       <http://rt.cpan.org/Public/Bug/Report.html?Queue=IO-HTML>.
254
255       You can follow or contribute to IO-HTML's development at
256       <https://github.com/madsen/io-html>.
257
259       This software is copyright (c) 2020 by Christopher J. Madsen.
260
261       This is free software; you can redistribute it and/or modify it under
262       the same terms as the Perl 5 programming language system itself.
263

DISCLAIMER OF WARRANTY

265       BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
266       FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT
267       WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER
268       PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND,
269       EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
270       WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE
271       ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH
272       YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL
273       NECESSARY SERVICING, REPAIR, OR CORRECTION.
274
275       IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
276       WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
277       REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENSE, BE LIABLE
278       TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR
279       CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
280       SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
281       RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
282       FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
283       SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
284       DAMAGES.
285
286
287
288perl v5.34.0                      2022-01-21                       IO::HTML(3)
Impressum