1Encode::Supported(3pm) Perl Programmers Reference Guide Encode::Supported(3pm)
2
3
4
6 Encode::Supported -- Encodings supported by Encode
7
9 Encoding Names
10 Encoding names are case insensitive. White space in names is ignored.
11 In addition, an encoding may have aliases. Each encoding has one
12 "canonical" name. The "canonical" name is chosen from the names of the
13 encoding by picking the first in the following sequence (with a few
14 exceptions).
15
16 · The name used by the Perl community. That includes 'utf8' and
17 'ascii'. Unlike aliases, canonical names directly reach the method
18 so such frequently used words like 'utf8' don't need to do alias
19 lookups.
20
21 · The MIME name as defined in IETF RFCs. This includes all "iso-"s.
22
23 · The name in the IANA registry.
24
25 · The name used by the organization that defined it.
26
27 In case de jure canonical names differ from that of the Encode module,
28 they are always aliased if it ever be implemented. So you can safely
29 tell if a given encoding is implemented or not just by passing the
30 canonical name.
31
32 Because of all the alias issues, and because in the general case
33 encodings have state, "Encode" uses an encoding object internally once
34 an operation is in progress.
35
37 As of Perl 5.8.0, at least the following encodings are recognized.
38 Note that unless otherwise specified, they are all case insensitive
39 (via alias) and all occurrence of spaces are replaced with '-'. In
40 other words, "ISO 8859 1" and "iso-8859-1" are identical.
41
42 Encodings are categorized and implemented in several different modules
43 but you don't have to "use Encode::XX" to make them available for most
44 cases. Encode.pm will automatically load those modules on demand.
45
46 Built-in Encodings
47 The following encodings are always available.
48
49 Canonical Aliases Comments & References
50 ----------------------------------------------------------------
51 ascii US-ascii ISO-646-US [ECMA]
52 ascii-ctrl Special Encoding
53 iso-8859-1 latin1 [ISO]
54 null Special Encoding
55 utf8 UTF-8 [RFC2279]
56 ----------------------------------------------------------------
57
58 null and ascii-ctrl are special. "null" fails for all character so
59 when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
60 CHARACTERS will fall back to character references. Ditto for "ascii-
61 ctrl" except for control characters. For fallback modes, see Encode.
62
63 Encode::Unicode -- other Unicode encodings
64 Unicode coding schemes other than native utf8 are supported by
65 Encode::Unicode, which will be autoloaded on demand.
66
67 ----------------------------------------------------------------
68 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
69 UCS-2LE [UC]
70 UTF-16 [UC]
71 UTF-16BE [UC]
72 UTF-16LE [UC]
73 UTF-32 [UC]
74 UTF-32BE UCS-4 [UC]
75 UTF-32LE [UC]
76 UTF-7 [RFC2152]
77 ----------------------------------------------------------------
78
79 To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, see
80 Encode::Unicode.
81
82 UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
83 encoding. It is implemented seperately by Encode::Unicode::UTF7.
84
85 Encode::Byte -- Extended ASCII
86 Encode::Byte implements most single-byte encodings except for Symbols
87 and EBCDIC. The following encodings are based on single-byte encodings
88 implemented as extended ASCII. Most of them map \x80-\xff (upper half)
89 to non-ASCII characters.
90
91 ISO-8859 and corresponding vendor mappings
92 Since there are so many, they are presented in table format with
93 languages and corresponding encoding names by vendors. Note that the
94 table is sorted in order of ISO-8859 and the corresponding vendor
95 mappings are slightly different from that of ISO. See
96 <http://czyborra.com/charsets/iso8859.html> for details.
97
98 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
99 ----------------------------------------------------------------
100 N. America (ASCII) cp437 AdobeStandardEncoding
101 cp863 (DOSCanadaF)
102 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
103 hp-roman8
104 cp860 (DOSPortuguese)
105 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
106 MacCroatian
107 MacRomanian
108 MacRumanian
109 Latin3[1] iso-8859-3
110 Latin4[2] iso-8859-4
111 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
112 (See also next section) cp866 MacUkrainian
113 Arabic iso-8859-6 cp864 cp1256 MacArabic
114 cp1006 MacFarsi
115 Greek iso-8859-7 cp737 cp1253 MacGreek
116 cp869 (DOSGreek2)
117 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
118 Turkish iso-8859-9 cp857 cp1254 MacTurkish
119 Nordics iso-8859-10 cp865
120 cp861 MacIcelandic
121 MacSami
122 Thai iso-8859-11[3] cp874 MacThai
123 (iso-8859-12 is nonexistent. Reserved for Indics?)
124 Baltics iso-8859-13 cp775 cp1257
125 Celtics iso-8859-14
126 Latin9 [4] iso-8859-15
127 Latin10 iso-8859-16
128 Vietnamese viscii cp1258 MacVietnamese
129 ----------------------------------------------------------------
130
131 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
132 [2] Baltics. Now on 8859-10, except for Latvian.
133 [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
134 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
135 letters that are missing from 8859-1 were added.
136
137 All cp* are also available as ibm-*, ms-*, and windows-* . See also
138 <http://czyborra.com/charsets/codepages.html>.
139
140 Macintosh encodings don't seem to be registered in such entities as
141 IANA. "Canonical" names in Encode are based upon Apple's Tech Note
142 1150. See <http://developer.apple.com/technotes/tn/tn1150.html> for
143 details.
144
145 KOI8 - De Facto Standard for the Cyrillic world
146 Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
147 popular in the Net. Encode comes with the following KOI charsets.
148 For gory details, see <http://czyborra.com/charsets/cyrillic.html>
149
150 ----------------------------------------------------------------
151 koi8-f
152 koi8-r cp878 [RFC1489]
153 koi8-u [RFC2319]
154 ----------------------------------------------------------------
155
156 gsm0338 - Hentai Latin 1
157 GSM0338 is for GSM handsets. Though it shares alphanumerals with ASCII,
158 control character ranges and other parts are mapped very differently,
159 mainly to store Greek characters. There are also escape sequences
160 (starting with 0x1B) to cover e.g. the Euro sign.
161
162 This was once handled by Encode::Bytes but because of all those unusual
163 specifications, Encode 2.20 has relocated the support to
164 Encode::GSM0338. See Encode::GSM0338 for details.
165
166 gsm0338 support before 2.19
167 Some special cases like a trailing 0x00 byte or a lone 0x1B byte are
168 not well-defined and decode() will return an empty string for them.
169 One possible workaround is
170
171 $gsm =~ s/\x00\z/\x00\x00/;
172 $uni = decode("gsm0338", $gsm);
173 $uni .= "\xA0" if $gsm =~ /\x1B\z/;
174
175 Note that the Encode implementation of GSM0338 does not implement the
176 reuse of Latin capital letters as Greek capital letters (for example,
177 the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK
178 CAPITAL LETTER ZETA).
179
180 The GSM0338 is also covered in Encode::Byte even though it is not an
181 "extended ASCII" encoding.
182
183 CJK: Chinese, Japanese, Korean (Multibyte)
184 Note that Vietnamese is listed above. Also read "Encoding vs Charset"
185 below. Also note that these are implemented in distinct modules by
186 countries, due to the size concerns (simplified Chinese is mapped to
187 'CN', continental China, while traditional Chinese is mapped to 'TW',
188 Taiwan). Please refer to their respective documentation pages.
189
190 Encode::CN -- Continental China
191 Standard DOS/Win Macintosh Comment/Reference
192 ----------------------------------------------------------------
193 euc-cn [1] MacChineseSimp
194 (gbk) cp936 [2]
195 gb12345-raw { GB12345 without CES }
196 gb2312-raw { GB2312 without CES }
197 hz
198 iso-ir-165
199 ----------------------------------------------------------------
200
201 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
202 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
203
204 Encode::JP -- Japan
205 Standard DOS/Win Macintosh Comment/Reference
206 ----------------------------------------------------------------
207 euc-jp
208 shiftjis cp932 macJapanese
209 7bit-jis
210 iso-2022-jp [RFC1468]
211 iso-2022-jp-1 [RFC2237]
212 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
213 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
214 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
215 ----------------------------------------------------------------
216
217 Encode::KR -- Korea
218 Standard DOS/Win Macintosh Comment/Reference
219 ----------------------------------------------------------------
220 euc-kr MacKorean [RFC1557]
221 cp949 [1]
222 iso-2022-kr [RFC1557]
223 johab [KS X 1001:1998, Annex 3]
224 ksc5601-raw { KSC5601 without CES }
225 ----------------------------------------------------------------
226
227 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
228 See below.
229
230 Encode::TW -- Taiwan
231 Standard DOS/Win Macintosh Comment/Reference
232 ----------------------------------------------------------------
233 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
234 big5-hkscs
235 ----------------------------------------------------------------
236
237 Encode::HanExtra -- More Chinese via CPAN
238 Due to the size concerns, additional Chinese encodings below are
239 distributed separately on CPAN, under the name Encode::HanExtra.
240
241 Standard DOS/Win Macintosh Comment/Reference
242 ----------------------------------------------------------------
243 big5ext CMEX's Big5e Extension
244 big5plus CMEX's Big5+ Extension
245 cccii Chinese Character Code for Information Interchange
246 euc-tw EUC (Extended Unix Character)
247 gb18030 GBK with Traditional Characters
248 ----------------------------------------------------------------
249
250 Encode::JIS2K -- JIS X 0213 encodings via CPAN
251 Due to size concerns, additional Japanese encodings below are
252 distributed separately on CPAN, under the name Encode::JIS2K.
253
254 Standard DOS/Win Macintosh Comment/Reference
255 ----------------------------------------------------------------
256 euc-jisx0213
257 shiftjisx0123
258 iso-2022-jp-3
259 jis0213-1-raw
260 jis0213-2-raw
261 ----------------------------------------------------------------
262
263 Miscellaneous encodings
264 Encode::EBCDIC
265 See perlebcdic for details.
266
267 ----------------------------------------------------------------
268 cp37
269 cp500
270 cp875
271 cp1026
272 cp1047
273 posix-bc
274 ----------------------------------------------------------------
275
276 Encode::Symbols
277 For symbols and dingbats.
278
279 ----------------------------------------------------------------
280 symbol
281 dingbats
282 MacDingbats
283 AdobeZdingbat
284 AdobeSymbol
285 ----------------------------------------------------------------
286
287 Encode::MIME::Header
288 Strictly speaking, MIME header encoding documented in RFC 2047 is
289 more of encapsulation than encoding. However, their support in
290 modern world is imperative so they are supported.
291
292 ----------------------------------------------------------------
293 MIME-Header [RFC2047]
294 MIME-B [RFC2047]
295 MIME-Q [RFC2047]
296 ----------------------------------------------------------------
297
298 Encode::Guess
299 This one is not a name of encoding but a utility that lets you pick
300 up the most appropriate encoding for a data out of given suspects.
301 See Encode::Guess for details.
302
304 The following encodings are not supported as yet; some because they are
305 rarely used, some because of technical difficulties. They may be
306 supported by external modules via CPAN in the future, however.
307
308 ISO-2022-JP-2 [RFC1554]
309 Not very popular yet. Needs Unicode Database or equivalent to
310 implement encode() (because it includes JIS X 0208/0212, KSC5601, and
311 GB2312 simultaneously, whose code points in Unicode overlap. So you
312 need to lookup the database to determine to what character set a
313 given Unicode character should belong).
314
315 ISO-2022-CN [RFC1922]
316 Not very popular. Needs CNS 11643-1 and -2 which are not available
317 in this module. CNS 11643 is supported (via euc-tw) in
318 Encode::HanExtra. Autrijus Tang may add support for this encoding in
319 his module in future.
320
321 Various HP-UX encodings
322 The following are unsupported due to the lack of mapping data.
323
324 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
325 '15' - japanese15, korean15, and roi15
326
327 Cyrillic encoding ISO-IR-111
328 Anton Tagunov doubts its usefulness.
329
330 ISO-8859-8-1 [Hebrew]
331 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
332 MacHebrew are supported because and just because there were mappings
333 available at <http://www.unicode.org/>). Contributions welcome.
334
335 ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
336 Ditto.
337
338 Thai encoding TCVN
339 Ditto.
340
341 Vietnamese encodings VPS
342 Though Jungshik Shin has reported that Mozilla supports this
343 encoding, it was too late before 5.8.0 for us to add it. In the
344 future, it may be available via a separate module. See
345 <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
346 and
347 <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
348 if you are interested in helping us.
349
350 Various Mac encodings
351 The following are unsupported due to the lack of mapping data.
352
353 MacArmenian, MacBengali, MacBurmese, MacEthiopic
354 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
355 MacLaotian, MacMalayalam, MacMongolian, MacOriya
356 MacSinhalese, MacTamil, MacTelugu, MacTibetan
357 MacVietnamese
358
359 The rest which are already available are based upon the vendor
360 mappings at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
361
362 (Mac) Indic encodings
363 The maps for the following are available at <http://www.unicode.org/>
364 but remain unsupport because those encodings need algorithmical
365 approach, currently unsupported by enc2xs:
366
367 MacDevanagari
368 MacGurmukhi
369 MacGujarati
370
371 For details, please see "Unicode mapping issues and notes:" at
372 <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
373
374 I believe this issue is prevalent not only for Mac Indics but also in
375 other Indic encodings, but the above were the only Indic encodings
376 maps that I could find at <http://www.unicode.org/> .
377
379 We are used to using the term (character) encoding and character set
380 interchangeably. But just as confusing the terms byte and character is
381 dangerous and the terms should be differentiated when needed, we need
382 to differentiate encoding and character set.
383
384 To understand that, here is a description of how we make computers grok
385 our characters.
386
387 · First we start with which characters to include. We call this
388 collection of characters character repertoire.
389
390 · Then we have to give each character a unique ID so your computer can
391 tell the difference between 'a' and 'A'. This itemized character
392 repertoire is now a character set.
393
394 · If your computer can grow the character set without further
395 processing, you can go ahead and use it. This is called a coded
396 character set (CCS) or raw character encoding. ASCII is used this
397 way for most cases.
398
399 · But in many cases, especially multi-byte CJK encodings, you have to
400 tweak a little more. Your network connection may not accept any data
401 with the Most Significant Bit set, and your computer may not be able
402 to tell if a given byte is a whole character or just half of it. So
403 you have to encode the character set to use it.
404
405 A character encoding scheme (CES) determines how to encode a given
406 character set, or a set of multiple character sets. 7bit ISO-2022 is
407 an example of a CES. You switch between character sets via escape
408 sequences.
409
410 Technically, or mathematically, speaking, a character set encoded in
411 such a CES that maps character by character may form a CCS. EUC is
412 such an example. The CES of EUC is as follows:
413
414 · Map ASCII unchanged.
415
416 · Map such a character set that consists of 94 or 96 powered by N
417 members by adding 0x80 to each byte.
418
419 · You can also use 0x8e and 0x8f to indicate that the following
420 sequence of characters belongs to yet another character set. To each
421 following byte is added the value 0x80.
422
423 By carefully looking at the encoded byte sequence, you can find that
424 the byte sequence conforms a unique number. In that sense, EUC is a
425 CCS generated by a CES above from up to four CCS (complicated?). UTF-8
426 falls into this category. See "UTF-8" in perlUnicode to find out how
427 UTF-8 maps Unicode to a byte sequence.
428
429 You may also have found out by now why 7bit ISO-2022 cannot comprise a
430 CCS. If you look at a byte sequence \x21\x21, you can't tell if it is
431 two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you
432 have no trouble differentiating between "!!". and " ".
433
435 This section tries to classify the supported encodings by their
436 applicability for information exchange over the Internet and to choose
437 the most suitable aliases to name them in the context of such
438 communication.
439
440 · To (en|de)code encodings marked by "(**)", you need
441 "Encode::HanExtra", available from CPAN.
442
443 Encoding names
444
445 US-ASCII UTF-8 ISO-8859-* KOI8-R
446 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
447 EUC-KR Big5 GB2312
448
449 are registered with IANA as preferred MIME names and may be used over
450 the Internet.
451
452 "Shift_JIS" has been officialized by JIS X 0208:1997. "Microsoft-
453 related naming mess" gives details.
454
455 "GB2312" is the IANA name for "EUC-CN". See "Microsoft-related naming
456 mess" for details.
457
458 "GB_2312-80" raw encoding is available as "gb2312-raw" with Encode. See
459 Encode::CN for details.
460
461 EUC-CN
462 KOI8-U [RFC2319]
463
464 have not been registered with IANA (as of March 2002) but seem to be
465 supported by major web browsers. The IANA name for "EUC-CN" is
466 "GB2312".
467
468 KS_C_5601-1987
469
470 is heavily misused. See "Microsoft-related naming mess" for details.
471
472 "KS_C_5601-1987" raw encoding is available as "kcs5601-raw" with
473 Encode. See Encode::KR for details.
474
475 UTF-16 UTF-16BE UTF-16LE
476
477 are IANA-registered "charset"s. See [RFC 2781] for details. Jungshik
478 Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and
479 NS 4/6. Beware however that
480
481 · "UTF-16" support in any software you're going to be
482 using/interoperating with has probably been less tested then "UTF-8"
483 support
484
485 · "UTF-8" coded data seamlessly passes traditional command piping
486 ("cat", "more", etc.) while "UTF-16" coded data is likely to cause
487 confusion (with its zero bytes, for example)
488
489 · it is beyond the power of words to describe the way HTML browsers
490 encode non-"ASCII" form data. To get a general impression, visit
491 <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>. While
492 encoding of form data has stabilized for "UTF-8" encoded pages (at
493 least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
494 expect fun (and cross-browser discrepancies) with "UTF-16" encoded
495 pages!
496
497 The rule of thumb is to use "UTF-8" unless you know what you're doing
498 and unless you really benefit from using "UTF-16".
499
500 ISO-IR-165 [RFC1345]
501 VISCII
502 GB 12345
503 GB 18030 (**) (see links bellow)
504 EUC-TW (**)
505
506 are totally valid encodings but not registered at IANA. The names
507 under which they are listed here are probably the most widely-known
508 names for these encodings and are recommended names.
509
510 BIG5PLUS (**)
511
512 is a proprietary name.
513
514 Microsoft-related naming mess
515 Microsoft products misuse the following names:
516
517 KS_C_5601-1987
518 Microsoft extension to "EUC-KR".
519
520 Proper names: "CP949", "UHC", "x-windows-949" (as used by Mozilla).
521
522 See
523 <http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
524 for details.
525
526 Encode aliases "KS_C_5601-1987" to "cp949" to reflect this common
527 misusage. Raw "KS_C_5601-1987" encoding is available as
528 "kcs5601-raw".
529
530 See Encode::KR for details.
531
532 GB2312
533 Microsoft extension to "EUC-CN".
534
535 Proper names: "CP936", "GBK".
536
537 "GB2312" has been registered in the "EUC-CN" meaning at IANA. This
538 has partially repaired the situation: Microsoft's "GB2312" has become
539 a superset of the official "GB2312".
540
541 Encode aliases "GB2312" to "euc-cn" in full agreement with IANA
542 registration. "cp936" is supported separately. Raw "GB_2312-80"
543 encoding is available as "gb2312-raw".
544
545 See Encode::CN for details.
546
547 Big5
548 Microsoft extension to "Big5".
549
550 Proper name: "CP950".
551
552 Encode separately supports "Big5" and "cp950".
553
554 Shift_JIS
555 Microsoft's understanding of "Shift_JIS".
556
557 JIS has not endorsed the full Microsoft standard however. The
558 official "Shift_JIS" includes only JIS X 0201 and JIS X 0208
559 character sets, while Microsoft has always used "Shift_JIS" to encode
560 a wider character repertoire. See "IANA" registration for
561 "Windows-31J".
562
563 As a historical predecessor, Microsoft's variant probably has more
564 rights for the name, though it may be objected that Microsoft
565 shouldn't have used JIS as part of the name in the first place.
566
567 Unambiguous name: "CP932". "IANA" name (also used by Mozilla, and
568 provided as an alias by Encode): "Windows-31J".
569
570 Encode separately supports "Shift_JIS" and "cp932".
571
573 character repertoire
574 A collection of unique characters. A character set in the strictest
575 sense. At this stage, characters are not numbered.
576
577 coded character set (CCS)
578 A character set that is mapped in a way computers can use directly.
579 Many character encodings, including EUC, fall in this category.
580
581 character encoding scheme (CES)
582 An algorithm to map a character set to a byte sequence. You don't
583 have to be able to tell which character set a given byte sequence
584 belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
585 example of being both a CCS and CES.
586
587 charset (in MIME context)
588 has long been used in the meaning of "encoding", CES.
589
590 While the word combination "character set" has lost this meaning in
591 MIME context since [RFC 2130], the "charset" abbreviation has
592 retained it. This is how [RFC 2277] and [RFC 2278] bless "charset":
593
594 This document uses the term "charset" to mean a set of rules for
595 mapping from a sequence of octets to a sequence of characters, such
596 as the combination of a coded character set and a character encoding
597 scheme; this is also what is used as an identifier in MIME "charset="
598 parameters, and registered in the IANA charset registry ... (Note
599 that this is NOT a term used by other standards bodies, such as ISO).
600 [RFC 2277]
601
602 EUC
603 Extended Unix Character. See ISO-2022.
604
605 ISO-2022
606 A CES that was carefully designed to coexist with ASCII. There are a
607 7 bit version and an 8 bit version.
608
609 The 7 bit version switches character set via escape sequence so it
610 cannot form a CCS. Since this is more difficult to handle in
611 programs than the 8 bit version, the 7 bit version is not very
612 popular except for iso-2022-jp, the de facto standard CES for
613 e-mails.
614
615 The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
616 thereof. Pre-5.6 perl could use them as string literals.
617
618 UCS
619 Short for Universal Character Set. When you say just UCS, it means
620 Unicode.
621
622 UCS-2
623 ISO/IEC 10646 encoding form: Universal Character Set coded in two
624 octets.
625
626 Unicode
627 A character set that aims to include all character repertoires of the
628 world. Many character sets in various national as well as industrial
629 standards have become, in a way, just subsets of Unicode.
630
631 UTF
632 Short for Unicode Transformation Format. Determines how to map a
633 Unicode character into a byte sequence.
634
635 UTF-16
636 A UTF in 16-bit encoding. Can either be in big endian or little
637 endian. The big endian version is called UTF-16BE (equal to UCS-2 +
638 surrogate support) and the little endian version is called UTF-16LE.
639
641 Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW,
642 Encode::EBCDIC, Encode::Symbol Encode::MIME::Header, Encode::Guess
643
645 ECMA
646 European Computer Manufacturers Association <http://www.ecma.ch>
647
648 ECMA-035 (eq "ISO-2022")
649 <http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
650
651 The specification of ISO-2022 is available from the link above.
652
653 IANA
654 Internet Assigned Numbers Authority <http://www.iana.org/>
655
656 Assigned Charset Names by IANA
657 <http://www.iana.org/assignments/character-sets>
658
659 Most of the "canonical names" in Encode derive from this list so
660 you can directly apply the string you have extracted from MIME
661 header of mails and web pages.
662
663 ISO
664 International Organization for Standardization <http://www.iso.ch/>
665
666 RFC
667 Request For Comments -- need I say more?
668 <http://www.rfc-editor.org/>, <http://www.rfc.net/>,
669 <http://www.faqs.org/rfcs/>
670
671 UC
672 Unicode Consortium <http://www.unicode.org/>
673
674 Unicode Glossary
675 <http://www.unicode.org/glossary/>
676
677 The glossary of this document is based upon this site.
678
679 Other Notable Sites
680 czyborra.com
681 <http://czyborra.com/>
682
683 Contains a lot of useful information, especially gory details of ISO
684 vs. vendor mappings.
685
686 CJK.inf
687 <http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
688
689 Somewhat obsolete (last update in 1996), but still useful. Also try
690
691 <ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
692
693 You will find brief info on "EUC-CN", "GBK" and mostly on "GB 18030".
694
695 Jungshik Shin's Hangul FAQ
696 <http://jshin.net/faq>
697
698 And especially its subject 8.
699
700 <http://jshin.net/faq/qa8.html>
701
702 A comprehensive overview of the Korean ("KS *") standards.
703
704 debian.org: "Introduction to i18n"
705 A brief description for most of the mentioned CJK encodings is
706 contained in
707 <http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
708
709 Offline sources
710 "CJKV Information Processing" by Ken Lunde
711 CJKV Information Processing 1999 O'Reilly & Associates, ISBN :
712 1-56592-224-7
713
714 The modern successor of "CJK.inf".
715
716 Features a comprehensive coverage of CJKV character sets and
717 encodings along with many other issues faced by anyone trying to
718 better support CJKV languages/scripts in all the areas of information
719 processing.
720
721 To purchase this book, visit
722 <http://www.oreilly.com/catalog/cjkvinfo/> or your favourite
723 bookstore.
724
725
726
727perl v5.10.1 2009-02-12 Encode::Supported(3pm)