1Encode::Supported(3pm) Perl Programmers Reference Guide Encode::Supported(3pm)
2
3
4
6 Encode::Supported -- Encodings supported by Encode
7
9 Encoding Names
10 Encoding names are case insensitive. White space in names is ignored.
11 In addition, an encoding may have aliases. Each encoding has one
12 "canonical" name. The "canonical" name is chosen from the names of the
13 encoding by picking the first in the following sequence (with a few
14 exceptions).
15
16 · The name used by the Perl community. That includes 'utf8' and
17 'ascii'. Unlike aliases, canonical names directly reach the method
18 so such frequently used words like 'utf8' don't need to do alias
19 lookups.
20
21 · The MIME name as defined in IETF RFCs. This includes all "iso-"s.
22
23 · The name in the IANA registry.
24
25 · The name used by the organization that defined it.
26
27 In case de jure canonical names differ from that of the Encode module,
28 they are always aliased if it ever be implemented. So you can safely
29 tell if a given encoding is implemented or not just by passing the
30 canonical name.
31
32 Because of all the alias issues, and because in the general case
33 encodings have state, "Encode" uses an encoding object internally once
34 an operation is in progress.
35
37 As of Perl 5.8.0, at least the following encodings are recognized.
38 Note that unless otherwise specified, they are all case insensitive
39 (via alias) and all occurrence of spaces are replaced with '-'. In
40 other words, "ISO 8859 1" and "iso-8859-1" are identical.
41
42 Encodings are categorized and implemented in several different modules
43 but you don't have to "use Encode::XX" to make them available for most
44 cases. Encode.pm will automatically load those modules on demand.
45
46 Built-in Encodings
47 The following encodings are always available.
48
49 Canonical Aliases Comments & References
50 ----------------------------------------------------------------
51 ascii US-ascii ISO-646-US [ECMA]
52 ascii-ctrl Special Encoding
53 iso-8859-1 latin1 [ISO]
54 null Special Encoding
55 utf8 UTF-8 [RFC2279]
56 ----------------------------------------------------------------
57
58 null and ascii-ctrl are special. "null" fails for all character so
59 when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
60 CHARACTERS will fall back to character references. Ditto for "ascii-
61 ctrl" except for control characters. For fallback modes, see Encode.
62
63 Encode::Unicode -- other Unicode encodings
64 Unicode coding schemes other than native utf8 are supported by
65 Encode::Unicode, which will be autoloaded on demand.
66
67 ----------------------------------------------------------------
68 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
69 UCS-2LE [UC]
70 UTF-16 [UC]
71 UTF-16BE [UC]
72 UTF-16LE [UC]
73 UTF-32 [UC]
74 UTF-32BE UCS-4 [UC]
75 UTF-32LE [UC]
76 UTF-7 [RFC2152]
77 ----------------------------------------------------------------
78
79 To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, see
80 Encode::Unicode.
81
82 UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
83 encoding. It is implemented seperately by Encode::Unicode::UTF7.
84
85 Encode::Byte -- Extended ASCII
86 Encode::Byte implements most single-byte encodings except for Symbols
87 and EBCDIC. The following encodings are based on single-byte encodings
88 implemented as extended ASCII. Most of them map \x80-\xff (upper half)
89 to non-ASCII characters.
90
91 ISO-8859 and corresponding vendor mappings
92 Since there are so many, they are presented in table format with
93 languages and corresponding encoding names by vendors. Note that the
94 table is sorted in order of ISO-8859 and the corresponding vendor
95 mappings are slightly different from that of ISO. See
96 <http://czyborra.com/charsets/iso8859.html> for details.
97
98 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
99 ----------------------------------------------------------------
100 N. America (ASCII) cp437 AdobeStandardEncoding
101 cp863 (DOSCanadaF)
102 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
103 hp-roman8
104 cp860 (DOSPortuguese)
105 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
106 MacCroatian
107 MacRomanian
108 MacRumanian
109 Latin3[1] iso-8859-3
110 Latin4[2] iso-8859-4
111 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
112 (See also next section) cp866 MacUkrainian
113 Arabic iso-8859-6 cp864 cp1256 MacArabic
114 cp1006 MacFarsi
115 Greek iso-8859-7 cp737 cp1253 MacGreek
116 cp869 (DOSGreek2)
117 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
118 Turkish iso-8859-9 cp857 cp1254 MacTurkish
119 Nordics iso-8859-10 cp865
120 cp861 MacIcelandic
121 MacSami
122 Thai iso-8859-11[3] cp874 MacThai
123 (iso-8859-12 is nonexistent. Reserved for Indics?)
124 Baltics iso-8859-13 cp775 cp1257
125 Celtics iso-8859-14
126 Latin9 [4] iso-8859-15
127 Latin10 iso-8859-16
128 Vietnamese viscii cp1258 MacVietnamese
129 ----------------------------------------------------------------
130
131 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
132 [2] Baltics. Now on 8859-10, except for Latvian.
133 [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
134 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
135 letters that are missing from 8859-1 were added.
136
137 All cp* are also available as ibm-*, ms-*, and windows-* . See also
138 <http://czyborra.com/charsets/codepages.html>.
139
140 Macintosh encodings don't seem to be registered in such entities as
141 IANA. "Canonical" names in Encode are based upon Apple's Tech Note
142 1150. See <http://developer.apple.com/technotes/tn/tn1150.html> for
143 details.
144
145 KOI8 - De Facto Standard for the Cyrillic world
146 Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
147 popular in the Net. Encode comes with the following KOI charsets.
148 For gory details, see <http://czyborra.com/charsets/cyrillic.html>
149
150 ----------------------------------------------------------------
151 koi8-f
152 koi8-r cp878 [RFC1489]
153 koi8-u [RFC2319]
154 ----------------------------------------------------------------
155
156 gsm0338 - Hentai Latin 1
157 GSM0338 is for GSM handsets. Though it shares alphanumerals with ASCII,
158 control character ranges and other parts are mapped very differently,
159 mainly to store Greek characters. There are also escape sequences
160 (starting with 0x1B) to cover e.g. the Euro sign.
161
162 This was once handled by Encode::Bytes but because of all those unusual
163 specifications, Encode 2.20 has relocated the support to
164 Encode::GSM0338. See Encode::GSM0338 for details.
165
166 gsm0338 support before 2.19
167 Some special cases like a trailing 0x00 byte or a lone 0x1B byte are
168 not well-defined and decode() will return an empty string for them.
169 One possible workaround is
170
171 $gsm =~ s/\x00\z/\x00\x00/;
172 $uni = decode("gsm0338", $gsm);
173 $uni .= "\xA0" if $gsm =~ /\x1B\z/;
174
175 Note that the Encode implementation of GSM0338 does not implement the
176 reuse of Latin capital letters as Greek capital letters (for example,
177 the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK
178 CAPITAL LETTER ZETA).
179
180 The GSM0338 is also covered in Encode::Byte even though it is not an
181 "extended ASCII" encoding.
182
183 CJK: Chinese, Japanese, Korean (Multibyte)
184 Note that Vietnamese is listed above. Also read "Encoding vs Charset"
185 below. Also note that these are implemented in distinct modules by
186 countries, due to the size concerns (simplified Chinese is mapped to
187 'CN', continental China, while traditional Chinese is mapped to 'TW',
188 Taiwan). Please refer to their respective documentation pages.
189
190 Encode::CN -- Continental China
191 Standard DOS/Win Macintosh Comment/Reference
192 ----------------------------------------------------------------
193 euc-cn [1] MacChineseSimp
194 (gbk) cp936 [2]
195 gb12345-raw { GB12345 without CES }
196 gb2312-raw { GB2312 without CES }
197 hz
198 iso-ir-165
199 ----------------------------------------------------------------
200
201 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
202 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
203
204 Encode::JP -- Japan
205 Standard DOS/Win Macintosh Comment/Reference
206 ----------------------------------------------------------------
207 euc-jp
208 shiftjis cp932 macJapanese
209 7bit-jis
210 iso-2022-jp [RFC1468]
211 iso-2022-jp-1 [RFC2237]
212 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
213 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
214 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
215 ----------------------------------------------------------------
216
217 Encode::KR -- Korea
218 Standard DOS/Win Macintosh Comment/Reference
219 ----------------------------------------------------------------
220 euc-kr MacKorean [RFC1557]
221 cp949 [1]
222 iso-2022-kr [RFC1557]
223 johab [KS X 1001:1998, Annex 3]
224 ksc5601-raw { KSC5601 without CES }
225 ----------------------------------------------------------------
226
227 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
228 See below.
229
230 Encode::TW -- Taiwan
231 Standard DOS/Win Macintosh Comment/Reference
232 ----------------------------------------------------------------
233 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
234 big5-hkscs
235 ----------------------------------------------------------------
236
237 Encode::HanExtra -- More Chinese via CPAN
238 Due to the size concerns, additional Chinese encodings below are
239 distributed separately on CPAN, under the name Encode::HanExtra.
240
241 Standard DOS/Win Macintosh Comment/Reference
242 ----------------------------------------------------------------
243 big5ext CMEX's Big5e Extension
244 big5plus CMEX's Big5+ Extension
245 cccii Chinese Character Code for Information Interchange
246 euc-tw EUC (Extended Unix Character)
247 gb18030 GBK with Traditional Characters
248 ----------------------------------------------------------------
249
250 Encode::JIS2K -- JIS X 0213 encodings via CPAN
251 Due to size concerns, additional Japanese encodings below are
252 distributed separately on CPAN, under the name Encode::JIS2K.
253
254 Standard DOS/Win Macintosh Comment/Reference
255 ----------------------------------------------------------------
256 euc-jisx0213
257 shiftjisx0123
258 iso-2022-jp-3
259 jis0213-1-raw
260 jis0213-2-raw
261 ----------------------------------------------------------------
262
263 Miscellaneous encodings
264 Encode::EBCDIC
265 See perlebcdic for details.
266
267 ----------------------------------------------------------------
268 cp37
269 cp500
270 cp875
271 cp1026
272 cp1047
273 posix-bc
274 ----------------------------------------------------------------
275
276 Encode::Symbols
277 For symbols and dingbats.
278
279 ----------------------------------------------------------------
280 symbol
281 dingbats
282 MacDingbats
283 AdobeZdingbat
284 AdobeSymbol
285 ----------------------------------------------------------------
286
287 Encode::MIME::Header
288 Strictly speaking, MIME header encoding documented in RFC 2047 is
289 more of encapsulation than encoding. However, their support in
290 modern world is imperative so they are supported.
291
292 ----------------------------------------------------------------
293 MIME-Header [RFC2047]
294 MIME-B [RFC2047]
295 MIME-Q [RFC2047]
296 ----------------------------------------------------------------
297
298 Encode::Guess
299 This one is not a name of encoding but a utility that lets you pick
300 up the most appropriate encoding for a data out of given suspects.
301 See Encode::Guess for details.
302
304 The following encodings are not supported as yet; some because they are
305 rarely used, some because of technical difficulties. They may be
306 supported by external modules via CPAN in the future, however.
307
308 ISO-2022-JP-2 [RFC1554]
309 Not very popular yet. Needs Unicode Database or equivalent to
310 implement encode() (because it includes JIS X 0208/0212, KSC5601, and
311 GB2312 simultaneously, whose code points in Unicode overlap. So you
312 need to lookup the database to determine to what character set a
313 given Unicode character should belong).
314
315 ISO-2022-CN [RFC1922]
316 Not very popular. Needs CNS 11643-1 and -2 which are not available
317 in this module. CNS 11643 is supported (via euc-tw) in
318 Encode::HanExtra. Autrijus Tang may add support for this encoding in
319 his module in future.
320
321 Various HP-UX encodings
322 The following are unsupported due to the lack of mapping data.
323
324 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
325 '15' - japanese15, korean15, and roi15
326
327 Cyrillic encoding ISO-IR-111
328 Anton Tagunov doubts its usefulness.
329
330 ISO-8859-8-1 [Hebrew]
331 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
332 MacHebrew are supported because and just because there were mappings
333 available at <http://www.unicode.org/>). Contributions welcome.
334
335 ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
336 Ditto.
337
338 Thai encoding TCVN
339 Ditto.
340
341 Vietnamese encodings VPS
342 Though Jungshik Shin has reported that Mozilla supports this
343 encoding, it was too late before 5.8.0 for us to add it. In the
344 future, it may be available via a separate module. See
345 <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
346 and
347 <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
348 if you are interested in helping us.
349
350 Various Mac encodings
351 The following are unsupported due to the lack of mapping data.
352
353 MacArmenian, MacBengali, MacBurmese, MacEthiopic
354 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
355 MacLaotian, MacMalayalam, MacMongolian, MacOriya
356 MacSinhalese, MacTamil, MacTelugu, MacTibetan
357 MacVietnamese
358
359 The rest which are already available are based upon the vendor
360 mappings at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
361
362 (Mac) Indic encodings
363 The maps for the following are available at <http://www.unicode.org/>
364 but remain unsupport because those encodings need algorithmical
365 approach, currently unsupported by enc2xs:
366
367 MacDevanagari
368 MacGurmukhi
369 MacGujarati
370
371 For details, please see "Unicode mapping issues and notes:" at
372 <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
373
374 I believe this issue is prevalent not only for Mac Indics but also in
375 other Indic encodings, but the above were the only Indic encodings
376 maps that I could find at <http://www.unicode.org/> .
377
379 We are used to using the term (character) encoding and character set
380 interchangeably. But just as confusing the terms byte and character is
381 dangerous and the terms should be differentiated when needed, we need
382 to differentiate encoding and character set.
383
384 To understand that, here is a description of how we make computers grok
385 our characters.
386
387 · First we start with which characters to include. We call this
388 collection of characters character repertoire.
389
390 · Then we have to give each character a unique ID so your computer can
391 tell the difference between 'a' and 'A'. This itemized character
392 repertoire is now a character set.
393
394 · If your computer can grow the character set without further
395 processing, you can go ahead and use it. This is called a coded
396 character set (CCS) or raw character encoding. ASCII is used this
397 way for most cases.
398
399 · But in many cases, especially multi-byte CJK encodings, you have to
400 tweak a little more. Your network connection may not accept any data
401 with the Most Significant Bit set, and your computer may not be able
402 to tell if a given byte is a whole character or just half of it. So
403 you have to encode the character set to use it.
404
405 A character encoding scheme (CES) determines how to encode a given
406 character set, or a set of multiple character sets. 7bit ISO-2022 is
407 an example of a CES. You switch between character sets via escape
408 sequences.
409
410 Technically, or mathematically, speaking, a character set encoded in
411 such a CES that maps character by character may form a CCS. EUC is
412 such an example. The CES of EUC is as follows:
413
414 · Map ASCII unchanged.
415
416 · Map such a character set that consists of 94 or 96 powered by N
417 members by adding 0x80 to each byte.
418
419 · You can also use 0x8e and 0x8f to indicate that the following
420 sequence of characters belongs to yet another character set. To each
421 following byte is added the value 0x80.
422
423 By carefully looking at the encoded byte sequence, you can find that
424 the byte sequence conforms a unique number. In that sense, EUC is a
425 CCS generated by a CES above from up to four CCS (complicated?). UTF-8
426 falls into this category. See "UTF-8" in perlUnicode to find out how
427 UTF-8 maps Unicode to a byte sequence.
428
429 You may also have found out by now why 7bit ISO-2022 cannot comprise a
430 CCS. If you look at a byte sequence \x21\x21, you can't tell if it is
431 two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you
432 have no trouble differentiating between "!!". and " ".
433
435 This section tries to classify the supported encodings by their
436 applicability for information exchange over the Internet and to choose
437 the most suitable aliases to name them in the context of such
438 communication.
439
440 · To (en|de)code encodings marked by "(**)", you need
441 "Encode::HanExtra", available from CPAN.
442
443 Encoding names
444
445 US-ASCII UTF-8 ISO-8859-* KOI8-R
446 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
447 EUC-KR Big5 GB2312
448
449 are registered with IANA as preferred MIME names and may be used over
450 the Internet.
451
452 "Shift_JIS" has been officialized by JIS X 0208:1997. "Microsoft-
453 related naming mess" gives details.
454
455 "GB2312" is the IANA name for "EUC-CN". See "Microsoft-related naming
456 mess" for details.
457
458 "GB_2312-80" raw encoding is available as "gb2312-raw" with Encode. See
459 Encode::CN for details.
460
461 EUC-CN
462 KOI8-U [RFC2319]
463
464 have not been registered with IANA (as of March 2002) but seem to be
465 supported by major web browsers. The IANA name for "EUC-CN" is
466 "GB2312".
467
468 KS_C_5601-1987
469
470 is heavily misused. See "Microsoft-related naming mess" for details.
471
472 "KS_C_5601-1987" raw encoding is available as "kcs5601-raw" with
473 Encode. See Encode::KR for details.
474
475 UTF-16 UTF-16BE UTF-16LE
476
477 are IANA-registered "charset"s. See [RFC 2781] for details. Jungshik
478 Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and
479 NS 4/6. Beware however that
480
481 · "UTF-16" support in any software you're going to be
482 using/interoperating with has probably been less tested then "UTF-8"
483 support
484
485 · "UTF-8" coded data seamlessly passes traditional command piping
486 ("cat", "more", etc.) while "UTF-16" coded data is likely to cause
487 confusion (with its zero bytes, for example)
488
489 · it is beyond the power of words to describe the way HTML browsers
490 encode non-"ASCII" form data. To get a general impression, visit
491 http://www.alanflavell.org.uk/charset/form-i18n.html
492 <http://www.alanflavell.org.uk/charset/form-i18n.html>. While
493 encoding of form data has stabilized for "UTF-8" encoded pages (at
494 least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
495 expect fun (and cross-browser discrepancies) with "UTF-16" encoded
496 pages!
497
498 The rule of thumb is to use "UTF-8" unless you know what you're doing
499 and unless you really benefit from using "UTF-16".
500
501 ISO-IR-165 [RFC1345]
502 VISCII
503 GB 12345
504 GB 18030 (**) (see links bellow)
505 EUC-TW (**)
506
507 are totally valid encodings but not registered at IANA. The names
508 under which they are listed here are probably the most widely-known
509 names for these encodings and are recommended names.
510
511 BIG5PLUS (**)
512
513 is a proprietary name.
514
515 Microsoft-related naming mess
516 Microsoft products misuse the following names:
517
518 KS_C_5601-1987
519 Microsoft extension to "EUC-KR".
520
521 Proper names: "CP949", "UHC", "x-windows-949" (as used by Mozilla).
522
523 See
524 http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
525 <http://lists.w3.org/Archives/Public/ietf-
526 charsets/2001AprJun/0033.html> for details.
527
528 Encode aliases "KS_C_5601-1987" to "cp949" to reflect this common
529 misusage. Raw "KS_C_5601-1987" encoding is available as
530 "kcs5601-raw".
531
532 See Encode::KR for details.
533
534 GB2312
535 Microsoft extension to "EUC-CN".
536
537 Proper names: "CP936", "GBK".
538
539 "GB2312" has been registered in the "EUC-CN" meaning at IANA. This
540 has partially repaired the situation: Microsoft's "GB2312" has become
541 a superset of the official "GB2312".
542
543 Encode aliases "GB2312" to "euc-cn" in full agreement with IANA
544 registration. "cp936" is supported separately. Raw "GB_2312-80"
545 encoding is available as "gb2312-raw".
546
547 See Encode::CN for details.
548
549 Big5
550 Microsoft extension to "Big5".
551
552 Proper name: "CP950".
553
554 Encode separately supports "Big5" and "cp950".
555
556 Shift_JIS
557 Microsoft's understanding of "Shift_JIS".
558
559 JIS has not endorsed the full Microsoft standard however. The
560 official "Shift_JIS" includes only JIS X 0201 and JIS X 0208
561 character sets, while Microsoft has always used "Shift_JIS" to encode
562 a wider character repertoire. See "IANA" registration for
563 "Windows-31J".
564
565 As a historical predecessor, Microsoft's variant probably has more
566 rights for the name, though it may be objected that Microsoft
567 shouldn't have used JIS as part of the name in the first place.
568
569 Unambiguous name: "CP932". "IANA" name (also used by Mozilla, and
570 provided as an alias by Encode): "Windows-31J".
571
572 Encode separately supports "Shift_JIS" and "cp932".
573
575 character repertoire
576 A collection of unique characters. A character set in the strictest
577 sense. At this stage, characters are not numbered.
578
579 coded character set (CCS)
580 A character set that is mapped in a way computers can use directly.
581 Many character encodings, including EUC, fall in this category.
582
583 character encoding scheme (CES)
584 An algorithm to map a character set to a byte sequence. You don't
585 have to be able to tell which character set a given byte sequence
586 belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an
587 example of being both a CCS and CES.
588
589 charset (in MIME context)
590 has long been used in the meaning of "encoding", CES.
591
592 While the word combination "character set" has lost this meaning in
593 MIME context since [RFC 2130], the "charset" abbreviation has
594 retained it. This is how [RFC 2277] and [RFC 2278] bless "charset":
595
596 This document uses the term "charset" to mean a set of rules for
597 mapping from a sequence of octets to a sequence of characters, such
598 as the combination of a coded character set and a character encoding
599 scheme; this is also what is used as an identifier in MIME "charset="
600 parameters, and registered in the IANA charset registry ... (Note
601 that this is NOT a term used by other standards bodies, such as ISO).
602 [RFC 2277]
603
604 EUC
605 Extended Unix Character. See ISO-2022.
606
607 ISO-2022
608 A CES that was carefully designed to coexist with ASCII. There are a
609 7 bit version and an 8 bit version.
610
611 The 7 bit version switches character set via escape sequence so it
612 cannot form a CCS. Since this is more difficult to handle in
613 programs than the 8 bit version, the 7 bit version is not very
614 popular except for iso-2022-jp, the de facto standard CES for
615 e-mails.
616
617 The 8 bit version can form a CCS. EUC and ISO-8859 are two examples
618 thereof. Pre-5.6 perl could use them as string literals.
619
620 UCS
621 Short for Universal Character Set. When you say just UCS, it means
622 Unicode.
623
624 UCS-2
625 ISO/IEC 10646 encoding form: Universal Character Set coded in two
626 octets.
627
628 Unicode
629 A character set that aims to include all character repertoires of the
630 world. Many character sets in various national as well as industrial
631 standards have become, in a way, just subsets of Unicode.
632
633 UTF
634 Short for Unicode Transformation Format. Determines how to map a
635 Unicode character into a byte sequence.
636
637 UTF-16
638 A UTF in 16-bit encoding. Can either be in big endian or little
639 endian. The big endian version is called UTF-16BE (equal to UCS-2 +
640 surrogate support) and the little endian version is called UTF-16LE.
641
643 Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW,
644 Encode::EBCDIC, Encode::Symbol Encode::MIME::Header, Encode::Guess
645
647 ECMA
648 European Computer Manufacturers Association <http://www.ecma.ch>
649
650 ECMA-035 (eq "ISO-2022")
651 http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM
652 <http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
653
654 The specification of ISO-2022 is available from the link above.
655
656 IANA
657 Internet Assigned Numbers Authority <http://www.iana.org/>
658
659 Assigned Charset Names by IANA
660 http://www.iana.org/assignments/character-sets
661 <http://www.iana.org/assignments/character-sets>
662
663 Most of the "canonical names" in Encode derive from this list so
664 you can directly apply the string you have extracted from MIME
665 header of mails and web pages.
666
667 ISO
668 International Organization for Standardization <http://www.iso.ch/>
669
670 RFC
671 Request For Comments -- need I say more? http://www.rfc-editor.org/
672 <http://www.rfc-editor.org/>, <http://www.ietf.org/rfc.html>,
673 <http://www.faqs.org/rfcs/>
674
675 UC
676 Unicode Consortium <http://www.unicode.org/>
677
678 Unicode Glossary
679 <http://www.unicode.org/glossary/>
680
681 The glossary of this document is based upon this site.
682
683 Other Notable Sites
684 czyborra.com
685 <http://czyborra.com/>
686
687 Contains a lot of useful information, especially gory details of ISO
688 vs. vendor mappings.
689
690 CJK.inf
691 <http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
692
693 Somewhat obsolete (last update in 1996), but still useful. Also try
694
695 <ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
696
697 You will find brief info on "EUC-CN", "GBK" and mostly on "GB 18030".
698
699 Jungshik Shin's Hangul FAQ
700 <http://jshin.net/faq>
701
702 And especially its subject 8.
703
704 <http://jshin.net/faq/qa8.html>
705
706 A comprehensive overview of the Korean ("KS *") standards.
707
708 debian.org: "Introduction to i18n"
709 A brief description for most of the mentioned CJK encodings is
710 contained in
711 http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html
712 <http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
713
714 Offline sources
715 "CJKV Information Processing" by Ken Lunde
716 CJKV Information Processing 1999 O'Reilly & Associates, ISBN :
717 1-56592-224-7
718
719 The modern successor of "CJK.inf".
720
721 Features a comprehensive coverage of CJKV character sets and
722 encodings along with many other issues faced by anyone trying to
723 better support CJKV languages/scripts in all the areas of information
724 processing.
725
726 To purchase this book, visit
727 <http://oreilly.com/catalog/9780596514471/> or your favourite
728 bookstore.
729
730
731
732perl v5.12.4 2011-06-01 Encode::Supported(3pm)