1HTML::Encoding(3) User Contributed Perl Documentation HTML::Encoding(3)
2
3
4
6 HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
7
9 use HTML::Encoding 'encoding_from_http_message';
10 use LWP::UserAgent;
11 use Encode;
12
13 my $resp = LWP::UserAgent->new->get('http://www.example.org');
14 my $enco = encoding_from_http_message($resp);
15 my $utf8 = decode($enco => $resp->content);
16
18 The interface and implementation are guranteed to change before this
19 module reaches version 1.00! Please send feedback to the author of this
20 module.
21
23 HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
24 documents...
25
27 Most routines need to know some suspected character encodings which can
28 be provided through the "encodings" option. This option always defaults
29 to the $HTML::Encoding::DEFAULT_ENCODINGS array reference which means
30 the following encodings are considered by default:
31
32 * ISO-8859-1
33 * UTF-16LE
34 * UTF-16BE
35 * UTF-32LE
36 * UTF-32BE
37 * UTF-8
38
39 If you change the values or pass custom values to the routines note
40 that Encode must support them in order for this module to work cor‐
41 rectly.
42
44 "encoding_from_xml_document", "encoding_from_html_document", and
45 "encoding_from_http_message" return in list context the encoding source
46 and the encoding name, possible encoding sources are
47
48 * protocol (Content-Type: text/html;charset=encoding)
49 * bom (leading U+FEFF)
50 * xml (<?xml version='1.0' encoding='encoding'?>)
51 * meta (<meta http-equiv=...)
52 * default (default fallback value)
53 * protocol_default (protocol default)
54
56 Routines exported by this module at user option. By default, nothing is
57 exported.
58
59 encoding_from_content_type($content_type)
60 Takes a byte string and uses HTTP::Headers::Util to extract the
61 charset parameter from the "Content-Type" header value and returns
62 its value or "undef" (or an empty list in list context) if there is
63 no such value. Only the first component will be examined (HTTP/1.1
64 only allows for one component), any backslash escapes in strings will
65 be unescaped, all leading and trailing quote marks and white-space
66 characters will be removed, all white-space will be collapsed to a
67 single space, empty charset values will be ignored and no case fold‐
68 ing is performed.
69
70 Examples:
71
72 +-----------------------------------------+-----------+
73 ⎪ encoding_from_content_type(...) ⎪ returns ⎪
74 +-----------------------------------------+-----------+
75 ⎪ "text/html" ⎪ undef ⎪
76 ⎪ "text/html,text/plain;charset=utf-8" ⎪ undef ⎪
77 ⎪ "text/html;charset=" ⎪ undef ⎪
78 ⎪ "text/html;charset=\"\\u\\t\\f\\-\\8\"" ⎪ 'utf-8' ⎪
79 ⎪ "text/html;charset=utf\\-8" ⎪ 'utf\\-8' ⎪
80 ⎪ "text/html;charset='utf-8'" ⎪ 'utf-8' ⎪
81 ⎪ "text/html;charset=\" UTF-8 \"" ⎪ 'UTF-8' ⎪
82 +-----------------------------------------+-----------+
83
84 If you pass a string with the UTF-8 flag turned on the string will be
85 converted to bytes before it is passed to HTTP::Headers::Util. The
86 return value will thus never have the UTF-8 flag turned on (this
87 might change in future versions).
88
89 encoding_from_byte_order_mark($octets [, %options])
90 Takes a sequence of octets and attempts to read a byte order mark at
91 the beginning of the octet sequence. It will go through the list of
92 $options{encodings} or the list of default encodings if no encodings
93 are specified and match the beginning of the string against any byte
94 order mark octet sequence found.
95
96 The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could
97 be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
98 U+0000 character. It is also possible that $octets starts with some‐
99 thing that looks like a byte order mark but actually is not.
100
101 encoding_from_byte_order_mark sorts the list of possible encodings by
102 the length of their BOM octet sequence and returns in scalar context
103 only the encoding with the longest match, and all encodings ordered
104 by length of their BOM octet sequence in list context.
105
106 Examples:
107
108 +-------------------------+------------+-----------------------+
109 ⎪ Input ⎪ Encodings ⎪ Result ⎪
110 +-------------------------+------------+-----------------------+
111 ⎪ "\xFF\xFE\x00\x00" ⎪ default ⎪ qw(UTF-32LE) ⎪
112 ⎪ "\xFF\xFE\x00\x00" ⎪ default ⎪ qw(UTF-32LE UTF-16LE) ⎪
113 ⎪ "\xEF\xBB\xBF" ⎪ default ⎪ qw(UTF-8) ⎪
114 ⎪ "Hello World!" ⎪ default ⎪ undef ⎪
115 ⎪ "\xDD\x73\x66\x73" ⎪ default ⎪ undef ⎪
116 ⎪ "\xDD\x73\x66\x73" ⎪ UTF-EBCDIC ⎪ qw(UTF-EBCDIC) ⎪
117 ⎪ "\x2B\x2F\x76\x38\x2D" ⎪ default ⎪ undef ⎪
118 ⎪ "\x2B\x2F\x76\x38\x2D" ⎪ UTF-7 ⎪ qw(UTF-7) ⎪
119 +-------------------------+------------+-----------------------+
120
121 Note however that for UTF-7 it is in theory possible that the U+FEFF
122 combines with other characters in which case such detection would
123 fail, for example consider:
124
125 +--------------------------------------+-----------+-----------+
126 ⎪ Input ⎪ Encodings ⎪ Result ⎪
127 +--------------------------------------+-----------+-----------+
128 ⎪ "\x2B\x2F\x76\x38\x41\x39\x67\x2D" ⎪ default ⎪ undef ⎪
129 ⎪ "\x2B\x2F\x76\x38\x41\x39\x67\x2D" ⎪ UTF-7 ⎪ undef ⎪
130 +--------------------------------------+-----------+-----------+
131
132 This might change in future versions, although this is not very rele‐
133 vant for most applications as there should never be need to use UTF-7
134 in the encoding list for existing documents.
135
136 If no BOM can be found it returns "undef" in scalar context and an
137 empty list in list context. This routine should not be used with
138 strings with the UTF-8 flag turned on.
139
140 encoding_from_xml_declaration($declaration)
141 Attempts to extract the value of the encoding pseudo-attribute in an
142 XML declaration or text declaration in the character string $declara‐
143 tion. If there does not appear to be such a value it returns nothing.
144 This would typically be used with the return values of xml_declara‐
145 tion_from_octets. Normalizes whitespaces like encoding_from_con‐
146 tent_type.
147
148 Examples:
149
150 +-------------------------------------------+---------+
151 ⎪ encoding_from_xml_declaration(...) ⎪ Result ⎪
152 +-------------------------------------------+---------+
153 ⎪ "<?xml version='1.0' encoding='utf-8'?>" ⎪ 'utf-8' ⎪
154 ⎪ "<?xml encoding='utf-8'?>" ⎪ 'utf-8' ⎪
155 ⎪ "<?xml encoding=\"utf-8\"?>" ⎪ 'utf-8' ⎪
156 ⎪ "<?xml foo='bar' encoding='utf-8'?>" ⎪ 'utf-8' ⎪
157 ⎪ "<?xml encoding='a' encoding='b'?>" ⎪ 'a' ⎪
158 ⎪ "<?xml encoding=' a b '?>" ⎪ 'a b' ⎪
159 ⎪ "<?xml-stylesheet encoding='utf-8'?>" ⎪ undef ⎪
160 ⎪ " <?xml encoding='utf-8'?>" ⎪ undef ⎪
161 ⎪ "<?xml encoding =\x{2028}'utf-8'?>" ⎪ 'utf-8' ⎪
162 ⎪ "<?xml version='1.0' encoding=utf-8?>" ⎪ undef ⎪
163 ⎪ "<?xml x='encoding=\"a\"' encoding='b'?>" ⎪ 'a' ⎪
164 +-------------------------------------------+---------+
165
166 Note that encoding_from_xml_declaration() determines the encoding
167 even if the XML declaration is not well-formed or violates other
168 requirements of the relevant XML specification as long as it can find
169 an encoding pseudo-attribute in the provided string. This means XML
170 processors must apply further checks to determine whether the entity
171 is well-formed, etc.
172
173 xml_declaration_from_octets($octets [, %options])
174 Attempts to find a ">" character in the byte string $octets using the
175 encodings in $encodings and upon success attempts to find a preceding
176 "<" character. Returns all the strings found this way in the order of
177 number of successful matches in list context and the best match in
178 scalar context. Should probably be combined with the only user of
179 this routine, encoding_from_xml_declaration... You can modify the
180 list of suspected encodings using $options{encodings};
181
182 encoding_from_first_chars($octets [, %options])
183 Assuming that documents start with "<" optionally preceded by white‐
184 space characters, encoding_from_first_chars attempts to determine an
185 encoding by matching $octets against something like
186 /^[@{$options{whitespace}}]*</ in the various suspected
187 $options{encodings}.
188
189 This is useful to distinguish e.g. UTF-16LE from UTF-8 if the byte
190 string does not start with a byte order mark nor an XML declaration
191 (e.g. if the document is a HTML document) to get at least a base
192 encoding which can be used to decode enough of the document to find
193 <meta> elements using encoding_from_meta_element. $options{white‐
194 space} defaults to qw/CR LF SP TB/. Returns nothing if unsuccessful.
195 Returns the matching encodings in order of the number of octets
196 matched in list context and the best match in scalar context.
197
198 Examples:
199
200 +---------------+----------+---------------------+
201 ⎪ String ⎪ Encoding ⎪ Result ⎪
202 +---------------+----------+---------------------+
203 ⎪ '<!DOCTYPE ' ⎪ UTF-16LE ⎪ UTF-16LE ⎪
204 ⎪ ' <!DOCTYPE ' ⎪ UTF-16LE ⎪ UTF-16LE ⎪
205 ⎪ '...' ⎪ UTF-16LE ⎪ undef ⎪
206 ⎪ '...<' ⎪ UTF-16LE ⎪ undef ⎪
207 ⎪ '<' ⎪ UTF-8 ⎪ ISO-8859-1 or UTF-8 ⎪
208 ⎪ "<!--\xF6-->" ⎪ UTF-8 ⎪ ISO-8859-1 or UTF-8 ⎪
209 +---------------+----------+---------------------+
210
211 encoding_from_meta_element($octets, $encname [, %options])
212 Attempts to find <meta> elements in the document using HTML::Parser.
213 It will attempt to decode chunks of the byte string using $encname to
214 characters before passing the data to HTML::Parser. An optional
215 %options hash can be provided which will be passed to the
216 HTML::Parser constructor. It will stop processing the document if it
217 encounters
218
219 * </head>
220 * encoding errors
221 * the end of the input
222 * ... (see todo)
223
224 If relevant <meta> elements, i.e. something like
225
226 <meta http-equiv=Content-Type content='...'>
227
228 are found, uses encoding_from_content_type to extract the charset
229 parameter. It returns all such encodings it could find in document
230 order in list context or the first encoding in scalar context (it
231 will currently look for others regardless of calling context) or
232 nothing if that fails for some reason.
233
234 Note that there are many edge cases where this does not yield in
235 "proper" results depending on the capabilities of the HTML::Parser
236 version and the options you pass for it, for example,
237
238 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
239 <!ENTITY content_type "text/html;charset=utf-8">
240 ]>
241 <meta http-equiv="Content-Type" content="&content_type;">
242 <title></title>
243 <p>...</p>
244
245 This would likely not detect the "utf-8" value if HTML::Parser does
246 not resolve the entity. This should however only be a concern for
247 documents specifically crafted to break the encoding detection.
248
249 encoding_from_xml_document($octets, [, %options])
250 Uses encoding_from_byte_order_mark to detect the encoding using a
251 byte order mark in the byte string and returns the return value of
252 that routine if it succeeds. Uses xml_declaration_from_octets and
253 encoding_from_xml_declaration and returns the encoding for which the
254 latter routine found most matches in scalar context, and all encod‐
255 ings ordered by number of occurences in list context. It does not
256 return a value of neither byte order mark not inbound declarations
257 declare a character encoding.
258
259 Examples:
260
261 +----------------------------+----------+-----------+----------+
262 ⎪ Input ⎪ Encoding ⎪ Encodings ⎪ Result ⎪
263 +----------------------------+----------+-----------+----------+
264 ⎪ "<?xml?>" ⎪ UTF-16 ⎪ default ⎪ UTF-16BE ⎪
265 ⎪ "<?xml?>" ⎪ UTF-16LE ⎪ default ⎪ undef ⎪
266 ⎪ "<?xml encoding='utf-8'?>" ⎪ UTF-16LE ⎪ default ⎪ utf-8 ⎪
267 ⎪ "<?xml encoding='utf-8'?>" ⎪ UTF-16 ⎪ default ⎪ UTF-16BE ⎪
268 ⎪ "<?xml encoding='cp37'?>" ⎪ CP37 ⎪ default ⎪ undef ⎪
269 ⎪ "<?xml encoding='cp37'?>" ⎪ CP37 ⎪ CP37 ⎪ cp37 ⎪
270 +----------------------------+----------+-----------+----------+
271
272 Lacking a return value from this routine and higher-level protocol
273 information (such as protocol encoding defaults) processors would be
274 required to assume that the document is UTF-8 encoded.
275
276 Note however that the return value depends on the set of suspected
277 encodings you pass to it. For example, by default, EBCDIC encodings
278 would not be considered and thus for
279
280 <?xml version='1.0' encoding='cp37'?>
281
282 this routine would return the undefined value. You can modify the
283 list of suspected encodings using $options{encodings}.
284
285 encoding_from_html_document($octets, [, %options])
286 Uses encoding_from_xml_document and encoding_from_meta_element to
287 determine the encoding of HTML documents. If $options{xhtml} is set
288 to a false value uses encoding_from_byte_order_mark and encod‐
289 ing_from_meta_element to determine the encoding. The xhtml option is
290 on by default. The $options{encodings} can be used to modify the sus‐
291 pected encodings and $options{parser_options} can be used to modify
292 the HTML::Parser options in encoding_from_meta_element (see the rele‐
293 vant documentation).
294
295 Returns nothing if no declaration could be found, the winning decla‐
296 ration in scalar context and a list of encoding source and encoding
297 name in list context, see ENCODING SOURCES.
298
299 ...
300
301 Other problems arise from differences between HTML and XHTML syntax
302 and encoding detection rules, for example, the input could be
303
304 Content-Type: text/html
305
306 <?xml version='1.0' encoding='utf-8'?>
307 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
308 "http://www.w3.org/TR/html4/strict.dtd">
309 <meta http-equiv = "Content-Type"
310 content = "text/html;charset=iso-8859-2">
311 <title></title>
312 <p>...</p>
313
314 This is a perfectly legal HTML 4.01 document and implementations
315 might be expected to consider the document ISO-8859-2 encoded as XML
316 rules for encoding detection do not apply to HTML documents. This
317 module attempts to avoid making decisions which rules apply for a
318 specific document and would thus by default return 'utf-8' for this
319 input.
320
321 On the other hand, if the input omits the encoding declaration,
322
323 Content-Type: text/html
324
325 <?xml version='1.0'?>
326 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
327 "http://www.w3.org/TR/html4/strict.dtd">
328 <meta http-equiv = "Content-Type"
329 content = "text/html;charset=iso-8859-2">
330 <title></title>
331 <p>...</p>
332
333 It would return 'iso-8859-2'. Similar problems would arise from other
334 differences between HTML and XHTML, for example consider
335
336 Content-Type: text/html
337
338 <?foo >
339 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
340 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
341 <html ...
342 ?>
343 ...
344 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
345 ...
346
347 If this is processed using HTML rules, the first > will end the pro‐
348 cessing instruction and the XHTML document type declaration would be
349 the relevant declaration for the document, if it is processed using
350 XHTML rules, the ?> will end the processing instruction and the HTML
351 document type declaration would be the relevant declaration.
352
353 IOW, an application would need to assume a certain character encoding
354 (family) to process enough of the document to determine whether it is
355 XHTML or HTML and the result of this detection would depend on which
356 processing rules are assumed in order to process it. It is thus in
357 essence not possible to write a "perfect" detection algorithm, which
358 is why this routine attempts to avoid making any decisions on this
359 matter.
360
361 encoding_from_http_message($message [, %options])
362 Determines the encoding of HTML / XML / XHTML documents enclosed in
363 HTTP message. $message is an object compatible to HTTP::Message, e.g.
364 a HTTP::Response object. %options is a hash with the following possi‐
365 ble entries:
366
367 encodings
368 array references of suspected character encodings, defaults to
369 $HTML::Encoding::DEFAULT_ENCODINGS.
370
371 is_html
372 Regular expression matched against the content_type of the message
373 to determine whether to use HTML rules for the entity body,
374 defaults to "qr{^text/html$}i".
375
376 is_xml
377 Regular expression matched against the content_type of the message
378 to determine whether to use XML rules for the entity body, defaults
379 to "qr{^.+/(?:.+\+)?xml$}i".
380
381 is_text_xml
382 Regular expression matched against the content_type of the message
383 to determine whether to use text/html rules for the message,
384 defaults to "qr{^text/(?:.+\+)?xml$}i". This will only be checked
385 if is_xml matches, too.
386
387 html_default
388 Default encoding for documents determined (by is_html) as HTML,
389 defaults to "ISO-8859-1".
390
391 xml_default
392 Default encoding for documents determined (by is_xml) as XML,
393 defaults to "UTF-8".
394
395 text_xml_default
396 Default encoding for documents determined (by is_text_xml) as
397 text/xml, defaults to "undef" in which case the default is ignored.
398 This should be set to "US-ASCII" if desired as this module is by
399 default inconsistent with RFC 3023 which requires that for text/xml
400 documents without a charset parameter in the HTTP header "US-ASCII"
401 is assumed.
402
403 This requirement is inconsistent with RFC 2616 (HTTP/1.1) which
404 requires to assume "ISO-8859-1", has been widely ignored and is
405 thus disabled by default.
406
407 xhtml
408 Whether the routine should look for an encoding declaration in the
409 XML declaration of the document (if any), defaults to 1.
410
411 default
412 Whether the relevant default value should be returned when no other
413 information can be determined, defaults to 1.
414
415 This is furhter possibly inconsistent with XML MIME types that differ
416 in other ways from application/xml, for example if the MIME Type does
417 not allow for a charset parameter in which case applications might be
418 expected to ignore the charset parameter if erroneously provided.
419
421 By default, this module does not support EBCDIC encodings. To enable
422 support for EBCDIC encodings you can either change the $HTML::Encod‐
423 ings::DEFAULT_ENCODINGS array reference or pass the encodings to the
424 routines you use using the encodings option, for example
425
426 my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
427 my $enc = encoding_from_xml_document($doc, encodings => \@try);
428
429 Note that there are some subtle differences between various EBCDIC
430 encodings, for example "!" is mapped to 0x5A in "posix-bc" and to 0x4F
431 in "cp500"; these differences might affect processing in yet undeter‐
432 mined ways.
433
435 * bundle with test suite
436 * optimize some routines to give up once successful
437 * avoid transcoding for HTML::Parser if e.g. ISO-8859-1
438
440 * http://www.w3.org/TR/REC-xml/#charencoding
441 * http://www.w3.org/TR/REC-xml/#sec-guessing
442 * http://www.w3.org/TR/xml11/#charencoding
443 * http://www.w3.org/TR/xml11/#sec-guessing
444 * http://www.w3.org/TR/html4/charset.html#h-5.2.2
445 * http://www.w3.org/TR/xhtml1/#C_9
446 * http://www.ietf.org/rfc/rfc2616.txt
447 * http://www.ietf.org/rfc/rfc2854.txt
448 * http://www.ietf.org/rfc/rfc3023.txt
449 * perlunicode
450 * Encode
451 * HTML::Parser
452
454 Copyright (c) 2004 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
455 This module is licensed under the same terms as Perl itself.
456
457
458
459perl v5.8.8 2007-04-18 HTML::Encoding(3)