HTML::Encoding(3pm)

1HTML::Encoding(3)     User Contributed Perl Documentation    HTML::Encoding(3)
2
3
4

NAME

6       HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
7

SYNOPSIS

9         use HTML::Encoding 'encoding_from_http_message';
10         use LWP::UserAgent;
11         use Encode;
12
13         my $resp = LWP::UserAgent->new->get('http://www.example.org');
14         my $enco = encoding_from_http_message($resp);
15         my $utf8 = decode($enco => $resp->content);
16

WARNING

18       The interface and implementation are guranteed to change before this
19       module reaches version 1.00! Please send feedback to the author of this
20       module.
21

DESCRIPTION

23       HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
24       documents...
25

DEFAULT ENCODINGS

27       Most routines need to know some suspected character encodings which can
28       be provided through the "encodings" option. This option always defaults
29       to the $HTML::Encoding::DEFAULT_ENCODINGS array reference which means
30       the following encodings are considered by default:
31
32         * ISO-8859-1
33         * UTF-16LE
34         * UTF-16BE
35         * UTF-32LE
36         * UTF-32BE
37         * UTF-8
38
39       If you change the values or pass custom values to the routines note
40       that Encode must support them in order for this module to work
41       correctly.
42

ENCODING SOURCES

44       "encoding_from_xml_document", "encoding_from_html_document", and
45       "encoding_from_http_message" return in list context the encoding source
46       and the encoding name, possible encoding sources are
47
48         * protocol         (Content-Type: text/html;charset=encoding)
49         * bom              (leading U+FEFF)
50         * xml              (<?xml version='1.0' encoding='encoding'?>)
51         * meta             (<meta http-equiv=...)
52         * default          (default fallback value)
53         * protocol_default (protocol default)
54

ROUTINES

56       Routines exported by this module at user option. By default, nothing is
57       exported.
58
59       encoding_from_content_type($content_type)
60         Takes a byte string and uses HTTP::Headers::Util to extract the
61         charset parameter from the "Content-Type" header value and returns
62         its value or "undef" (or an empty list in list context) if there is
63         no such value. Only the first component will be examined (HTTP/1.1
64         only allows for one component), any backslash escapes in strings will
65         be unescaped, all leading and trailing quote marks and white-space
66         characters will be removed, all white-space will be collapsed to a
67         single space, empty charset values will be ignored and no case
68         folding is performed.
69
70         Examples:
71
72           +-----------------------------------------+-----------+
73           | encoding_from_content_type(...)         | returns   |
74           +-----------------------------------------+-----------+
75           | "text/html"                             | undef     |
76           | "text/html,text/plain;charset=utf-8"    | undef     |
77           | "text/html;charset="                    | undef     |
78           | "text/html;charset=\"\\u\\t\\f\\-\\8\"" | 'utf-8'   |
79           | "text/html;charset=utf\\-8"             | 'utf\\-8' |
80           | "text/html;charset='utf-8'"             | 'utf-8'   |
81           | "text/html;charset=\" UTF-8 \""         | 'UTF-8'   |
82           +-----------------------------------------+-----------+
83
84         If you pass a string with the UTF-8 flag turned on the string will be
85         converted to bytes before it is passed to HTTP::Headers::Util.  The
86         return value will thus never have the UTF-8 flag turned on (this
87         might change in future versions).
88
89       encoding_from_byte_order_mark($octets [, %options])
90         Takes a sequence of octets and attempts to read a byte order mark at
91         the beginning of the octet sequence. It will go through the list of
92         $options{encodings} or the list of default encodings if no encodings
93         are specified and match the beginning of the string against any byte
94         order mark octet sequence found.
95
96         The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could
97         be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
98         U+0000 character. It is also possible that $octets starts with
99         something that looks like a byte order mark but actually is not.
100
101         encoding_from_byte_order_mark sorts the list of possible encodings by
102         the length of their BOM octet sequence and returns in scalar context
103         only the encoding with the longest match, and all encodings ordered
104         by length of their BOM octet sequence in list context.
105
106         Examples:
107
108           +-------------------------+------------+-----------------------+
109           | Input                   | Encodings  | Result                |
110           +-------------------------+------------+-----------------------+
111           | "\xFF\xFE\x00\x00"      | default    | qw(UTF-32LE)          |
112           | "\xFF\xFE\x00\x00"      | default    | qw(UTF-32LE UTF-16LE) |
113           | "\xEF\xBB\xBF"          | default    | qw(UTF-8)             |
114           | "Hello World!"          | default    | undef                 |
115           | "\xDD\x73\x66\x73"      | default    | undef                 |
116           | "\xDD\x73\x66\x73"      | UTF-EBCDIC | qw(UTF-EBCDIC)        |
117           | "\x2B\x2F\x76\x38\x2D"  | default    | undef                 |
118           | "\x2B\x2F\x76\x38\x2D"  | UTF-7      | qw(UTF-7)             |
119           +-------------------------+------------+-----------------------+
120
121         Note however that for UTF-7 it is in theory possible that the U+FEFF
122         combines with other characters in which case such detection would
123         fail, for example consider:
124
125           +--------------------------------------+-----------+-----------+
126           | Input                                | Encodings | Result    |
127           +--------------------------------------+-----------+-----------+
128           | "\x2B\x2F\x76\x38\x41\x39\x67\x2D"   | default   | undef     |
129           | "\x2B\x2F\x76\x38\x41\x39\x67\x2D"   | UTF-7     | undef     |
130           +--------------------------------------+-----------+-----------+
131
132         This might change in future versions, although this is not very
133         relevant for most applications as there should never be need to use
134         UTF-7 in the encoding list for existing documents.
135
136         If no BOM can be found it returns "undef" in scalar context and an
137         empty list in list context. This routine should not be used with
138         strings with the UTF-8 flag turned on.
139
140       encoding_from_xml_declaration($declaration)
141         Attempts to extract the value of the encoding pseudo-attribute in an
142         XML declaration or text declaration in the character string
143         $declaration. If there does not appear to be such a value it returns
144         nothing. This would typically be used with the return values of
145         xml_declaration_from_octets.  Normalizes whitespaces like
146         encoding_from_content_type.
147
148         Examples:
149
150           +-------------------------------------------+---------+
151           | encoding_from_xml_declaration(...)        | Result  |
152           +-------------------------------------------+---------+
153           | "<?xml version='1.0' encoding='utf-8'?>"  | 'utf-8' |
154           | "<?xml encoding='utf-8'?>"                | 'utf-8' |
155           | "<?xml encoding=\"utf-8\"?>"              | 'utf-8' |
156           | "<?xml foo='bar' encoding='utf-8'?>"      | 'utf-8' |
157           | "<?xml encoding='a' encoding='b'?>"       | 'a'     |
158           | "<?xml encoding=' a    b '?>"             | 'a b'   |
159           | "<?xml-stylesheet encoding='utf-8'?>"     | undef   |
160           | " <?xml encoding='utf-8'?>"               | undef   |
161           | "<?xml encoding =\x{2028}'utf-8'?>"       | 'utf-8' |
162           | "<?xml version='1.0' encoding=utf-8?>"    | undef   |
163           | "<?xml x='encoding=\"a\"' encoding='b'?>" | 'a'     |
164           +-------------------------------------------+---------+
165
166         Note that encoding_from_xml_declaration() determines the encoding
167         even if the XML declaration is not well-formed or violates other
168         requirements of the relevant XML specification as long as it can find
169         an encoding pseudo-attribute in the provided string. This means XML
170         processors must apply further checks to determine whether the entity
171         is well-formed, etc.
172
173       xml_declaration_from_octets($octets [, %options])
174         Attempts to find a ">" character in the byte string $octets using the
175         encodings in $encodings and upon success attempts to find a preceding
176         "<" character. Returns all the strings found this way in the order of
177         number of successful matches in list context and the best match in
178         scalar context. Should probably be combined with the only user of
179         this routine, encoding_from_xml_declaration... You can modify the
180         list of suspected encodings using $options{encodings};
181
182       encoding_from_first_chars($octets [, %options])
183         Assuming that documents start with "<" optionally preceded by
184         whitespace characters, encoding_from_first_chars attempts to
185         determine an encoding by matching $octets against something like
186         /^[@{$options{whitespace}}]*</ in the various suspected
187         $options{encodings}.
188
189         This is useful to distinguish e.g. UTF-16LE from UTF-8 if the byte
190         string does not start with a byte order mark nor an XML declaration
191         (e.g. if the document is a HTML document) to get at least a base
192         encoding which can be used to decode enough of the document to find
193         <meta> elements using encoding_from_meta_element.
194         $options{whitespace} defaults to qw/CR LF SP TB/.  Returns nothing if
195         unsuccessful. Returns the matching encodings in order of the number
196         of octets matched in list context and the best match in scalar
197         context.
198
199         Examples:
200
201           +---------------+----------+---------------------+
202           | String        | Encoding | Result              |
203           +---------------+----------+---------------------+
204           | '<!DOCTYPE '  | UTF-16LE | UTF-16LE            |
205           | ' <!DOCTYPE ' | UTF-16LE | UTF-16LE            |
206           | '...'         | UTF-16LE | undef               |
207           | '...<'        | UTF-16LE | undef               |
208           | '<'           | UTF-8    | ISO-8859-1 or UTF-8 |
209           | "<!--\xF6-->" | UTF-8    | ISO-8859-1 or UTF-8 |
210           +---------------+----------+---------------------+
211
212       encoding_from_meta_element($octets, $encname [, %options])
213         Attempts to find <meta> elements in the document using HTML::Parser.
214         It will attempt to decode chunks of the byte string using $encname to
215         characters before passing the data to HTML::Parser. An optional
216         %options hash can be provided which will be passed to the
217         HTML::Parser constructor. It will stop processing the document if it
218         encounters
219
220           * </head>
221           * encoding errors
222           * the end of the input
223           * ... (see todo)
224
225         If relevant <meta> elements, i.e. something like
226
227           <meta http-equiv=Content-Type content='...'>
228
229         are found, uses encoding_from_content_type to extract the charset
230         parameter. It returns all such encodings it could find in document
231         order in list context or the first encoding in scalar context (it
232         will currently look for others regardless of calling context) or
233         nothing if that fails for some reason.
234
235         Note that there are many edge cases where this does not yield in
236         "proper" results depending on the capabilities of the HTML::Parser
237         version and the options you pass for it, for example,
238
239           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
240             <!ENTITY content_type "text/html;charset=utf-8">
241           ]>
242           <meta http-equiv="Content-Type" content="&content_type;">
243           <title></title>
244           <p>...</p>
245
246         This would likely not detect the "utf-8" value if HTML::Parser does
247         not resolve the entity. This should however only be a concern for
248         documents specifically crafted to break the encoding detection.
249
250       encoding_from_xml_document($octets, [, %options])
251         Uses encoding_from_byte_order_mark to detect the encoding using a
252         byte order mark in the byte string and returns the return value of
253         that routine if it succeeds. Uses xml_declaration_from_octets and
254         encoding_from_xml_declaration and returns the encoding for which the
255         latter routine found most matches in scalar context, and all
256         encodings ordered by number of occurences in list context. It does
257         not return a value of neither byte order mark not inbound
258         declarations declare a character encoding.
259
260         Examples:
261
262           +----------------------------+----------+-----------+----------+
263           | Input                      | Encoding | Encodings | Result   |
264           +----------------------------+----------+-----------+----------+
265           | "<?xml?>"                  | UTF-16   | default   | UTF-16BE |
266           | "<?xml?>"                  | UTF-16LE | default   | undef    |
267           | "<?xml encoding='utf-8'?>" | UTF-16LE | default   | utf-8    |
268           | "<?xml encoding='utf-8'?>" | UTF-16   | default   | UTF-16BE |
269           | "<?xml encoding='cp37'?>"  | CP37     | default   | undef    |
270           | "<?xml encoding='cp37'?>"  | CP37     | CP37      | cp37     |
271           +----------------------------+----------+-----------+----------+
272
273         Lacking a return value from this routine and higher-level protocol
274         information (such as protocol encoding defaults) processors would be
275         required to assume that the document is UTF-8 encoded.
276
277         Note however that the return value depends on the set of suspected
278         encodings you pass to it. For example, by default, EBCDIC encodings
279         would not be considered and thus for
280
281           <?xml version='1.0' encoding='cp37'?>
282
283         this routine would return the undefined value. You can modify the
284         list of suspected encodings using $options{encodings}.
285
286       encoding_from_html_document($octets, [, %options])
287         Uses encoding_from_xml_document and encoding_from_meta_element to
288         determine the encoding of HTML documents. If $options{xhtml} is set
289         to a false value uses encoding_from_byte_order_mark and
290         encoding_from_meta_element to determine the encoding. The xhtml
291         option is on by default. The $options{encodings} can be used to
292         modify the suspected encodings and $options{parser_options} can be
293         used to modify the HTML::Parser options in encoding_from_meta_element
294         (see the relevant documentation).
295
296         Returns nothing if no declaration could be found, the winning
297         declaration in scalar context and a list of encoding source and
298         encoding name in list context, see ENCODING SOURCES.
299
300         ...
301
302         Other problems arise from differences between HTML and XHTML syntax
303         and encoding detection rules, for example, the input could be
304
305           Content-Type: text/html
306
307           <?xml version='1.0' encoding='utf-8'?>
308           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
309           "http://www.w3.org/TR/html4/strict.dtd">
310           <meta http-equiv = "Content-Type"
311                    content = "text/html;charset=iso-8859-2">
312           <title></title>
313           <p>...</p>
314
315         This is a perfectly legal HTML 4.01 document and implementations
316         might be expected to consider the document ISO-8859-2 encoded as XML
317         rules for encoding detection do not apply to HTML documents.  This
318         module attempts to avoid making decisions which rules apply for a
319         specific document and would thus by default return 'utf-8' for this
320         input.
321
322         On the other hand, if the input omits the encoding declaration,
323
324           Content-Type: text/html
325
326           <?xml version='1.0'?>
327           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
328           "http://www.w3.org/TR/html4/strict.dtd">
329           <meta http-equiv = "Content-Type"
330                    content = "text/html;charset=iso-8859-2">
331           <title></title>
332           <p>...</p>
333
334         It would return 'iso-8859-2'. Similar problems would arise from other
335         differences between HTML and XHTML, for example consider
336
337           Content-Type: text/html
338
339           <?foo >
340           <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
341               "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
342           <html ...
343           ?>
344           ...
345           <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
346           ...
347
348         If this is processed using HTML rules, the first > will end the
349         processing instruction and the XHTML document type declaration would
350         be the relevant declaration for the document, if it is processed
351         using XHTML rules, the ?> will end the processing instruction and the
352         HTML document type declaration would be the relevant declaration.
353
354         IOW, an application would need to assume a certain character encoding
355         (family) to process enough of the document to determine whether it is
356         XHTML or HTML and the result of this detection would depend on which
357         processing rules are assumed in order to process it.  It is thus in
358         essence not possible to write a "perfect" detection algorithm, which
359         is why this routine attempts to avoid making any decisions on this
360         matter.
361
362       encoding_from_http_message($message [, %options])
363         Determines the encoding of HTML / XML / XHTML documents enclosed in
364         HTTP message. $message is an object compatible to HTTP::Message, e.g.
365         a HTTP::Response object. %options is a hash with the following
366         possible entries:
367
368         encodings
369           array references of suspected character encodings, defaults to
370           $HTML::Encoding::DEFAULT_ENCODINGS.
371
372         is_html
373           Regular expression matched against the content_type of the message
374           to determine whether to use HTML rules for the entity body,
375           defaults to "qr{^text/html$}i".
376
377         is_xml
378           Regular expression matched against the content_type of the message
379           to determine whether to use XML rules for the entity body, defaults
380           to "qr{^.+/(?:.+\+)?xml$}i".
381
382         is_text_xml
383           Regular expression matched against the content_type of the message
384           to determine whether to use text/html rules for the message,
385           defaults to "qr{^text/(?:.+\+)?xml$}i". This will only be checked
386           if is_xml matches aswell.
387
388         html_default
389           Default encoding for documents determined (by is_html) as HTML,
390           defaults to "ISO-8859-1".
391
392         xml_default
393           Default encoding for documents determined (by is_xml) as XML,
394           defaults to "UTF-8".
395
396         text_xml_default
397           Default encoding for documents determined (by is_text_xml) as
398           text/xml, defaults to "undef" in which case the default is ignored.
399           This should be set to "US-ASCII" if desired as this module is by
400           default inconsistent with RFC 3023 which requires that for text/xml
401           documents without a charset parameter in the HTTP header "US-ASCII"
402           is assumed.
403
404           This requirement is inconsistent with RFC 2616 (HTTP/1.1) which
405           requires to assume "ISO-8859-1", has been widely ignored and is
406           thus disabled by default.
407
408         xhtml
409           Whether the routine should look for an encoding declaration in the
410           XML declaration of the document (if any), defaults to 1.
411
412         default
413           Whether the relevant default value should be returned when no other
414           information can be determined, defaults to 1.
415
416         This is furhter possibly inconsistent with XML MIME types that differ
417         in other ways from application/xml, for example if the MIME Type does
418         not allow for a charset parameter in which case applications might be
419         expected to ignore the charset parameter if erroneously provided.
420

EBCDIC SUPPORT

422       By default, this module does not support EBCDIC encodings. To enable
423       support for EBCDIC encodings you can either change the
424       $HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the
425       encodings to the routines you use using the encodings option, for
426       example
427
428         my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
429         my $enc = encoding_from_xml_document($doc, encodings => \@try);
430
431       Note that there are some subtle differences between various EBCDIC
432       encodings, for example "!" is mapped to 0x5A in "posix-bc" and to 0x4F
433       in "cp500"; these differences might affect processing in yet
434       undetermined ways.
435

TODO

437         * bundle with test suite
438         * optimize some routines to give up once successful
439         * avoid transcoding for HTML::Parser if e.g. ISO-8859-1
440         * consider adding a "HTML5" modus of operation?
441

AUTHOR / COPYRIGHT / LICENSE

457         Copyright (c) 2004-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
458         This module is licensed under the same terms as Perl itself.
459
460
461
462perl v5.38.0                      2023-07-20                 HTML::Encoding(3)