1HTML::Encoding(3) User Contributed Perl Documentation HTML::Encoding(3)
2
3
4
6 HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
7
9 use HTML::Encoding 'encoding_from_http_message';
10 use LWP::UserAgent;
11 use Encode;
12
13 my $resp = LWP::UserAgent->new->get('http://www.example.org');
14 my $enco = encoding_from_http_message($resp);
15 my $utf8 = decode($enco => $resp->content);
16
18 The interface and implementation are guranteed to change before this
19 module reaches version 1.00! Please send feedback to the author of this
20 module.
21
23 HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
24 documents...
25
27 Most routines need to know some suspected character encodings which can
28 be provided through the "encodings" option. This option always defaults
29 to the $HTML::Encoding::DEFAULT_ENCODINGS array reference which means
30 the following encodings are considered by default:
31
32 * ISO-8859-1
33 * UTF-16LE
34 * UTF-16BE
35 * UTF-32LE
36 * UTF-32BE
37 * UTF-8
38
39 If you change the values or pass custom values to the routines note
40 that Encode must support them in order for this module to work
41 correctly.
42
44 "encoding_from_xml_document", "encoding_from_html_document", and
45 "encoding_from_http_message" return in list context the encoding source
46 and the encoding name, possible encoding sources are
47
48 * protocol (Content-Type: text/html;charset=encoding)
49 * bom (leading U+FEFF)
50 * xml (<?xml version='1.0' encoding='encoding'?>)
51 * meta (<meta http-equiv=...)
52 * default (default fallback value)
53 * protocol_default (protocol default)
54
56 Routines exported by this module at user option. By default, nothing is
57 exported.
58
59 encoding_from_content_type($content_type)
60 Takes a byte string and uses HTTP::Headers::Util to extract the
61 charset parameter from the "Content-Type" header value and returns
62 its value or "undef" (or an empty list in list context) if there is
63 no such value. Only the first component will be examined (HTTP/1.1
64 only allows for one component), any backslash escapes in strings will
65 be unescaped, all leading and trailing quote marks and white-space
66 characters will be removed, all white-space will be collapsed to a
67 single space, empty charset values will be ignored and no case
68 folding is performed.
69
70 Examples:
71
72 +-----------------------------------------+-----------+
73 | encoding_from_content_type(...) | returns |
74 +-----------------------------------------+-----------+
75 | "text/html" | undef |
76 | "text/html,text/plain;charset=utf-8" | undef |
77 | "text/html;charset=" | undef |
78 | "text/html;charset=\"\\u\\t\\f\\-\\8\"" | 'utf-8' |
79 | "text/html;charset=utf\\-8" | 'utf\\-8' |
80 | "text/html;charset='utf-8'" | 'utf-8' |
81 | "text/html;charset=\" UTF-8 \"" | 'UTF-8' |
82 +-----------------------------------------+-----------+
83
84 If you pass a string with the UTF-8 flag turned on the string will be
85 converted to bytes before it is passed to HTTP::Headers::Util. The
86 return value will thus never have the UTF-8 flag turned on (this
87 might change in future versions).
88
89 encoding_from_byte_order_mark($octets [, %options])
90 Takes a sequence of octets and attempts to read a byte order mark at
91 the beginning of the octet sequence. It will go through the list of
92 $options{encodings} or the list of default encodings if no encodings
93 are specified and match the beginning of the string against any byte
94 order mark octet sequence found.
95
96 The result can be ambiguous, for example qq(\xFF\xFE\x00\x00) could
97 be both, a complete BOM in UTF-32LE or a UTF-16LE BOM followed by a
98 U+0000 character. It is also possible that $octets starts with
99 something that looks like a byte order mark but actually is not.
100
101 encoding_from_byte_order_mark sorts the list of possible encodings by
102 the length of their BOM octet sequence and returns in scalar context
103 only the encoding with the longest match, and all encodings ordered
104 by length of their BOM octet sequence in list context.
105
106 Examples:
107
108 +-------------------------+------------+-----------------------+
109 | Input | Encodings | Result |
110 +-------------------------+------------+-----------------------+
111 | "\xFF\xFE\x00\x00" | default | qw(UTF-32LE) |
112 | "\xFF\xFE\x00\x00" | default | qw(UTF-32LE UTF-16LE) |
113 | "\xEF\xBB\xBF" | default | qw(UTF-8) |
114 | "Hello World!" | default | undef |
115 | "\xDD\x73\x66\x73" | default | undef |
116 | "\xDD\x73\x66\x73" | UTF-EBCDIC | qw(UTF-EBCDIC) |
117 | "\x2B\x2F\x76\x38\x2D" | default | undef |
118 | "\x2B\x2F\x76\x38\x2D" | UTF-7 | qw(UTF-7) |
119 +-------------------------+------------+-----------------------+
120
121 Note however that for UTF-7 it is in theory possible that the U+FEFF
122 combines with other characters in which case such detection would
123 fail, for example consider:
124
125 +--------------------------------------+-----------+-----------+
126 | Input | Encodings | Result |
127 +--------------------------------------+-----------+-----------+
128 | "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | default | undef |
129 | "\x2B\x2F\x76\x38\x41\x39\x67\x2D" | UTF-7 | undef |
130 +--------------------------------------+-----------+-----------+
131
132 This might change in future versions, although this is not very
133 relevant for most applications as there should never be need to use
134 UTF-7 in the encoding list for existing documents.
135
136 If no BOM can be found it returns "undef" in scalar context and an
137 empty list in list context. This routine should not be used with
138 strings with the UTF-8 flag turned on.
139
140 encoding_from_xml_declaration($declaration)
141 Attempts to extract the value of the encoding pseudo-attribute in an
142 XML declaration or text declaration in the character string
143 $declaration. If there does not appear to be such a value it returns
144 nothing. This would typically be used with the return values of
145 xml_declaration_from_octets. Normalizes whitespaces like
146 encoding_from_content_type.
147
148 Examples:
149
150 +-------------------------------------------+---------+
151 | encoding_from_xml_declaration(...) | Result |
152 +-------------------------------------------+---------+
153 | "<?xml version='1.0' encoding='utf-8'?>" | 'utf-8' |
154 | "<?xml encoding='utf-8'?>" | 'utf-8' |
155 | "<?xml encoding=\"utf-8\"?>" | 'utf-8' |
156 | "<?xml foo='bar' encoding='utf-8'?>" | 'utf-8' |
157 | "<?xml encoding='a' encoding='b'?>" | 'a' |
158 | "<?xml encoding=' a b '?>" | 'a b' |
159 | "<?xml-stylesheet encoding='utf-8'?>" | undef |
160 | " <?xml encoding='utf-8'?>" | undef |
161 | "<?xml encoding =\x{2028}'utf-8'?>" | 'utf-8' |
162 | "<?xml version='1.0' encoding=utf-8?>" | undef |
163 | "<?xml x='encoding=\"a\"' encoding='b'?>" | 'a' |
164 +-------------------------------------------+---------+
165
166 Note that encoding_from_xml_declaration() determines the encoding
167 even if the XML declaration is not well-formed or violates other
168 requirements of the relevant XML specification as long as it can find
169 an encoding pseudo-attribute in the provided string. This means XML
170 processors must apply further checks to determine whether the entity
171 is well-formed, etc.
172
173 xml_declaration_from_octets($octets [, %options])
174 Attempts to find a ">" character in the byte string $octets using the
175 encodings in $encodings and upon success attempts to find a preceding
176 "<" character. Returns all the strings found this way in the order of
177 number of successful matches in list context and the best match in
178 scalar context. Should probably be combined with the only user of
179 this routine, encoding_from_xml_declaration... You can modify the
180 list of suspected encodings using $options{encodings};
181
182 encoding_from_first_chars($octets [, %options])
183 Assuming that documents start with "<" optionally preceded by
184 whitespace characters, encoding_from_first_chars attempts to
185 determine an encoding by matching $octets against something like
186 /^[@{$options{whitespace}}]*</ in the various suspected
187 $options{encodings}.
188
189 This is useful to distinguish e.g. UTF-16LE from UTF-8 if the byte
190 string does not start with a byte order mark nor an XML declaration
191 (e.g. if the document is a HTML document) to get at least a base
192 encoding which can be used to decode enough of the document to find
193 <meta> elements using encoding_from_meta_element.
194 $options{whitespace} defaults to qw/CR LF SP TB/. Returns nothing if
195 unsuccessful. Returns the matching encodings in order of the number
196 of octets matched in list context and the best match in scalar
197 context.
198
199 Examples:
200
201 +---------------+----------+---------------------+
202 | String | Encoding | Result |
203 +---------------+----------+---------------------+
204 | '<!DOCTYPE ' | UTF-16LE | UTF-16LE |
205 | ' <!DOCTYPE ' | UTF-16LE | UTF-16LE |
206 | '...' | UTF-16LE | undef |
207 | '...<' | UTF-16LE | undef |
208 | '<' | UTF-8 | ISO-8859-1 or UTF-8 |
209 | "<!--\xF6-->" | UTF-8 | ISO-8859-1 or UTF-8 |
210 +---------------+----------+---------------------+
211
212 encoding_from_meta_element($octets, $encname [, %options])
213 Attempts to find <meta> elements in the document using HTML::Parser.
214 It will attempt to decode chunks of the byte string using $encname to
215 characters before passing the data to HTML::Parser. An optional
216 %options hash can be provided which will be passed to the
217 HTML::Parser constructor. It will stop processing the document if it
218 encounters
219
220 * </head>
221 * encoding errors
222 * the end of the input
223 * ... (see todo)
224
225 If relevant <meta> elements, i.e. something like
226
227 <meta http-equiv=Content-Type content='...'>
228
229 are found, uses encoding_from_content_type to extract the charset
230 parameter. It returns all such encodings it could find in document
231 order in list context or the first encoding in scalar context (it
232 will currently look for others regardless of calling context) or
233 nothing if that fails for some reason.
234
235 Note that there are many edge cases where this does not yield in
236 "proper" results depending on the capabilities of the HTML::Parser
237 version and the options you pass for it, for example,
238
239 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" [
240 <!ENTITY content_type "text/html;charset=utf-8">
241 ]>
242 <meta http-equiv="Content-Type" content="&content_type;">
243 <title></title>
244 <p>...</p>
245
246 This would likely not detect the "utf-8" value if HTML::Parser does
247 not resolve the entity. This should however only be a concern for
248 documents specifically crafted to break the encoding detection.
249
250 encoding_from_xml_document($octets, [, %options])
251 Uses encoding_from_byte_order_mark to detect the encoding using a
252 byte order mark in the byte string and returns the return value of
253 that routine if it succeeds. Uses xml_declaration_from_octets and
254 encoding_from_xml_declaration and returns the encoding for which the
255 latter routine found most matches in scalar context, and all
256 encodings ordered by number of occurences in list context. It does
257 not return a value of neither byte order mark not inbound
258 declarations declare a character encoding.
259
260 Examples:
261
262 +----------------------------+----------+-----------+----------+
263 | Input | Encoding | Encodings | Result |
264 +----------------------------+----------+-----------+----------+
265 | "<?xml?>" | UTF-16 | default | UTF-16BE |
266 | "<?xml?>" | UTF-16LE | default | undef |
267 | "<?xml encoding='utf-8'?>" | UTF-16LE | default | utf-8 |
268 | "<?xml encoding='utf-8'?>" | UTF-16 | default | UTF-16BE |
269 | "<?xml encoding='cp37'?>" | CP37 | default | undef |
270 | "<?xml encoding='cp37'?>" | CP37 | CP37 | cp37 |
271 +----------------------------+----------+-----------+----------+
272
273 Lacking a return value from this routine and higher-level protocol
274 information (such as protocol encoding defaults) processors would be
275 required to assume that the document is UTF-8 encoded.
276
277 Note however that the return value depends on the set of suspected
278 encodings you pass to it. For example, by default, EBCDIC encodings
279 would not be considered and thus for
280
281 <?xml version='1.0' encoding='cp37'?>
282
283 this routine would return the undefined value. You can modify the
284 list of suspected encodings using $options{encodings}.
285
286 encoding_from_html_document($octets, [, %options])
287 Uses encoding_from_xml_document and encoding_from_meta_element to
288 determine the encoding of HTML documents. If $options{xhtml} is set
289 to a false value uses encoding_from_byte_order_mark and
290 encoding_from_meta_element to determine the encoding. The xhtml
291 option is on by default. The $options{encodings} can be used to
292 modify the suspected encodings and $options{parser_options} can be
293 used to modify the HTML::Parser options in encoding_from_meta_element
294 (see the relevant documentation).
295
296 Returns nothing if no declaration could be found, the winning
297 declaration in scalar context and a list of encoding source and
298 encoding name in list context, see ENCODING SOURCES.
299
300 ...
301
302 Other problems arise from differences between HTML and XHTML syntax
303 and encoding detection rules, for example, the input could be
304
305 Content-Type: text/html
306
307 <?xml version='1.0' encoding='utf-8'?>
308 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
309 "http://www.w3.org/TR/html4/strict.dtd">
310 <meta http-equiv = "Content-Type"
311 content = "text/html;charset=iso-8859-2">
312 <title></title>
313 <p>...</p>
314
315 This is a perfectly legal HTML 4.01 document and implementations
316 might be expected to consider the document ISO-8859-2 encoded as XML
317 rules for encoding detection do not apply to HTML documents. This
318 module attempts to avoid making decisions which rules apply for a
319 specific document and would thus by default return 'utf-8' for this
320 input.
321
322 On the other hand, if the input omits the encoding declaration,
323
324 Content-Type: text/html
325
326 <?xml version='1.0'?>
327 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
328 "http://www.w3.org/TR/html4/strict.dtd">
329 <meta http-equiv = "Content-Type"
330 content = "text/html;charset=iso-8859-2">
331 <title></title>
332 <p>...</p>
333
334 It would return 'iso-8859-2'. Similar problems would arise from other
335 differences between HTML and XHTML, for example consider
336
337 Content-Type: text/html
338
339 <?foo >
340 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
341 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
342 <html ...
343 ?>
344 ...
345 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
346 ...
347
348 If this is processed using HTML rules, the first > will end the
349 processing instruction and the XHTML document type declaration would
350 be the relevant declaration for the document, if it is processed
351 using XHTML rules, the ?> will end the processing instruction and the
352 HTML document type declaration would be the relevant declaration.
353
354 IOW, an application would need to assume a certain character encoding
355 (family) to process enough of the document to determine whether it is
356 XHTML or HTML and the result of this detection would depend on which
357 processing rules are assumed in order to process it. It is thus in
358 essence not possible to write a "perfect" detection algorithm, which
359 is why this routine attempts to avoid making any decisions on this
360 matter.
361
362 encoding_from_http_message($message [, %options])
363 Determines the encoding of HTML / XML / XHTML documents enclosed in
364 HTTP message. $message is an object compatible to HTTP::Message, e.g.
365 a HTTP::Response object. %options is a hash with the following
366 possible entries:
367
368 encodings
369 array references of suspected character encodings, defaults to
370 $HTML::Encoding::DEFAULT_ENCODINGS.
371
372 is_html
373 Regular expression matched against the content_type of the message
374 to determine whether to use HTML rules for the entity body,
375 defaults to "qr{^text/html$}i".
376
377 is_xml
378 Regular expression matched against the content_type of the message
379 to determine whether to use XML rules for the entity body, defaults
380 to "qr{^.+/(?:.+\+)?xml$}i".
381
382 is_text_xml
383 Regular expression matched against the content_type of the message
384 to determine whether to use text/html rules for the message,
385 defaults to "qr{^text/(?:.+\+)?xml$}i". This will only be checked
386 if is_xml matches aswell.
387
388 html_default
389 Default encoding for documents determined (by is_html) as HTML,
390 defaults to "ISO-8859-1".
391
392 xml_default
393 Default encoding for documents determined (by is_xml) as XML,
394 defaults to "UTF-8".
395
396 text_xml_default
397 Default encoding for documents determined (by is_text_xml) as
398 text/xml, defaults to "undef" in which case the default is ignored.
399 This should be set to "US-ASCII" if desired as this module is by
400 default inconsistent with RFC 3023 which requires that for text/xml
401 documents without a charset parameter in the HTTP header "US-ASCII"
402 is assumed.
403
404 This requirement is inconsistent with RFC 2616 (HTTP/1.1) which
405 requires to assume "ISO-8859-1", has been widely ignored and is
406 thus disabled by default.
407
408 xhtml
409 Whether the routine should look for an encoding declaration in the
410 XML declaration of the document (if any), defaults to 1.
411
412 default
413 Whether the relevant default value should be returned when no other
414 information can be determined, defaults to 1.
415
416 This is furhter possibly inconsistent with XML MIME types that differ
417 in other ways from application/xml, for example if the MIME Type does
418 not allow for a charset parameter in which case applications might be
419 expected to ignore the charset parameter if erroneously provided.
420
422 By default, this module does not support EBCDIC encodings. To enable
423 support for EBCDIC encodings you can either change the
424 $HTML::Encodings::DEFAULT_ENCODINGS array reference or pass the
425 encodings to the routines you use using the encodings option, for
426 example
427
428 my @try = qw/UTF-8 UTF-16LE cp500 posix-bc .../;
429 my $enc = encoding_from_xml_document($doc, encodings => \@try);
430
431 Note that there are some subtle differences between various EBCDIC
432 encodings, for example "!" is mapped to 0x5A in "posix-bc" and to 0x4F
433 in "cp500"; these differences might affect processing in yet
434 undetermined ways.
435
437 * bundle with test suite
438 * optimize some routines to give up once successful
439 * avoid transcoding for HTML::Parser if e.g. ISO-8859-1
440 * consider adding a "HTML5" modus of operation?
441
443 * http://www.w3.org/TR/REC-xml/#charencoding
444 * http://www.w3.org/TR/REC-xml/#sec-guessing
445 * http://www.w3.org/TR/xml11/#charencoding
446 * http://www.w3.org/TR/xml11/#sec-guessing
447 * http://www.w3.org/TR/html4/charset.html#h-5.2.2
448 * http://www.w3.org/TR/xhtml1/#C_9
449 * http://www.ietf.org/rfc/rfc2616.txt
450 * http://www.ietf.org/rfc/rfc2854.txt
451 * http://www.ietf.org/rfc/rfc3023.txt
452 * perlunicode
453 * Encode
454 * HTML::Parser
455
457 Copyright (c) 2004-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
458 This module is licensed under the same terms as Perl itself.
459
460
461
462perl v5.32.0 2020-07-28 HTML::Encoding(3)