1Encode::Supported(3pm) Perl Programmers Reference Guide Encode::Supported(3pm)
2
3
4
6 Encode::Supported -- Encodings supported by Encode
7
9 Encoding Names
10
11 Encoding names are case insensitive. White space in names is ignored.
12 In addition, an encoding may have aliases. Each encoding has one
13 "canonical" name. The "canonical" name is chosen from the names of the
14 encoding by picking the first in the following sequence (with a few
15 exceptions).
16
17 · The name used by the Perl community. That includes 'utf8' and
18 'ascii'. Unlike aliases, canonical names directly reach the method
19 so such frequently used words like 'utf8' don't need to do alias
20 lookups.
21
22 · The MIME name as defined in IETF RFCs. This includes all "iso-"s.
23
24 · The name in the IANA registry.
25
26 · The name used by the organization that defined it.
27
28 In case de jure canonical names differ from that of the Encode module,
29 they are always aliased if it ever be implemented. So you can safely
30 tell if a given encoding is implemented or not just by passing the
31 canonical name.
32
33 Because of all the alias issues, and because in the general case encod‐
34 ings have state, "Encode" uses an encoding object internally once an
35 operation is in progress.
36
38 As of Perl 5.8.0, at least the following encodings are recognized.
39 Note that unless otherwise specified, they are all case insensitive
40 (via alias) and all occurrence of spaces are replaced with '-'. In
41 other words, "ISO 8859 1" and "iso-8859-1" are identical.
42
43 Encodings are categorized and implemented in several different modules
44 but you don't have to "use Encode::XX" to make them available for most
45 cases. Encode.pm will automatically load those modules on demand.
46
47 Built-in Encodings
48
49 The following encodings are always available.
50
51 Canonical Aliases Comments & References
52 ----------------------------------------------------------------
53 ascii US-ascii ISO-646-US [ECMA]
54 ascii-ctrl Special Encoding
55 iso-8859-1 latin1 [ISO]
56 null Special Encoding
57 utf8 UTF-8 [RFC2279]
58 ----------------------------------------------------------------
59
60 null and ascii-ctrl are special. "null" fails for all character so
61 when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL CHARAC‐
62 TERS will fall back to character references. Ditto for "ascii-ctrl"
63 except for control characters. For fallback modes, see Encode.
64
65 Encode::Unicode -- other Unicode encodings
66
67 Unicode coding schemes other than native utf8 are supported by
68 Encode::Unicode, which will be autoloaded on demand.
69
70 ----------------------------------------------------------------
71 UCS-2BE UCS-2, iso-10646-1 [IANA, UC]
72 UCS-2LE [UC]
73 UTF-16 [UC]
74 UTF-16BE [UC]
75 UTF-16LE [UC]
76 UTF-32 [UC]
77 UTF-32BE UCS-4 [UC]
78 UTF-32LE [UC]
79 UTF-7 [RFC2152]
80 ----------------------------------------------------------------
81
82 To find how (UCS-2⎪UTF-(16⎪32))(LE⎪BE)? differ from one another, see
83 Encode::Unicode.
84
85 UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
86 encoding. It is implemented seperately by Encode::Unicode::UTF7.
87
88 Encode::Byte -- Extended ASCII
89
90 Encode::Byte implements most single-byte encodings except for Symbols
91 and EBCDIC. The following encodings are based on single-byte encodings
92 implemented as extended ASCII. Most of them map \x80-\xff (upper half)
93 to non-ASCII characters.
94
95 ISO-8859 and corresponding vendor mappings
96 Since there are so many, they are presented in table format with
97 languages and corresponding encoding names by vendors. Note that
98 the table is sorted in order of ISO-8859 and the corresponding ven‐
99 dor mappings are slightly different from that of ISO. See
100 <http://czyborra.com/charsets/iso8859.html> for details.
101
102 Lang/Regions ISO/Other Std. DOS Windows Macintosh Others
103 ----------------------------------------------------------------
104 N. America (ASCII) cp437 AdobeStandardEncoding
105 cp863 (DOSCanadaF)
106 W. Europe iso-8859-1 cp850 cp1252 MacRoman nextstep
107 hp-roman8
108 cp860 (DOSPortuguese)
109 Cntrl. Europe iso-8859-2 cp852 cp1250 MacCentralEurRoman
110 MacCroatian
111 MacRomanian
112 MacRumanian
113 Latin3[1] iso-8859-3
114 Latin4[2] iso-8859-4
115 Cyrillics iso-8859-5 cp855 cp1251 MacCyrillic
116 (See also next section) cp866 MacUkrainian
117 Arabic iso-8859-6 cp864 cp1256 MacArabic
118 cp1006 MacFarsi
119 Greek iso-8859-7 cp737 cp1253 MacGreek
120 cp869 (DOSGreek2)
121 Hebrew iso-8859-8 cp862 cp1255 MacHebrew
122 Turkish iso-8859-9 cp857 cp1254 MacTurkish
123 Nordics iso-8859-10 cp865
124 cp861 MacIcelandic
125 MacSami
126 Thai iso-8859-11[3] cp874 MacThai
127 (iso-8859-12 is nonexistent. Reserved for Indics?)
128 Baltics iso-8859-13 cp775 cp1257
129 Celtics iso-8859-14
130 Latin9 [4] iso-8859-15
131 Latin10 iso-8859-16
132 Vietnamese viscii cp1258 MacVietnamese
133 ----------------------------------------------------------------
134
135 [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
136 [2] Baltics. Now on 8859-10, except for Latvian.
137 [3] TIS 620 + Non-Breaking Space (0xA0 / U+00A0)
138 [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
139 letters that are missing from 8859-1 were added.
140
141 All cp* are also available as ibm-*, ms-*, and windows-* . See
142 also <http://czyborra.com/charsets/codepages.html>.
143
144 Macintosh encodings don't seem to be registered in such entities as
145 IANA. "Canonical" names in Encode are based upon Apple's Tech Note
146 1150. See <http://developer.apple.com/technotes/tn/tn1150.html>
147 for details.
148
149 KOI8 - De Facto Standard for the Cyrillic world
150 Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
151 popular in the Net. Encode comes with the following KOI charsets.
152 For gory details, see <http://czyborra.com/charsets/cyrillic.html>
153
154 ----------------------------------------------------------------
155 koi8-f
156 koi8-r cp878 [RFC1489]
157 koi8-u [RFC2319]
158 ----------------------------------------------------------------
159
160 gsm0338 - Hentai Latin 1
161 GSM0338 is for GSM handsets. Though it shares alphanumerals with
162 ASCII, control character ranges and other parts are mapped very
163 differently, mainly to store Greek characters. There are also
164 escape sequences (starting with 0x1B) to cover e.g. the Euro sign.
165 Some special cases like a trailing 0x00 byte or a lone 0x1B byte
166 are not well-defined and decode() will return an empty string for
167 them. One possible workaround is
168
169 $gsm =~ s/\x00\z/\x00\x00/;
170 $uni = decode("gsm0338", $gsm);
171 $uni .= "\xA0" if $gsm =~ /\x1B\z/;
172
173 Note that the Encode implementation of GSM0338 does not implement
174 the reuse of Latin capital letters as Greek capital letters (for
175 example, the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396
176 (GREEK CAPITAL LETTER ZETA).
177
178 The GSM0338 is also covered in Encode::Byte even though it is not
179 an "extended ASCII" encoding.
180
181 CJK: Chinese, Japanese, Korean (Multibyte)
182
183 Note that Vietnamese is listed above. Also read "Encoding vs Charset"
184 below. Also note that these are implemented in distinct modules by
185 countries, due to the size concerns (simplified Chinese is mapped to
186 'CN', continental China, while traditional Chinese is mapped to 'TW',
187 Taiwan). Please refer to their respective documentation pages.
188
189 Encode::CN -- Continental China
190 Standard DOS/Win Macintosh Comment/Reference
191 ----------------------------------------------------------------
192 euc-cn [1] MacChineseSimp
193 (gbk) cp936 [2]
194 gb12345-raw { GB12345 without CES }
195 gb2312-raw { GB2312 without CES }
196 hz
197 iso-ir-165
198 ----------------------------------------------------------------
199
200 [1] GB2312 is aliased to this. See L<Microsoft-related naming mess>
201 [2] gbk is aliased to this. See L<Microsoft-related naming mess>
202
203 Encode::JP -- Japan
204 Standard DOS/Win Macintosh Comment/Reference
205 ----------------------------------------------------------------
206 euc-jp
207 shiftjis cp932 macJapanese
208 7bit-jis
209 iso-2022-jp [RFC1468]
210 iso-2022-jp-1 [RFC2237]
211 jis0201-raw { JIS X 0201 (roman + halfwidth kana) without CES }
212 jis0208-raw { JIS X 0208 (Kanji + fullwidth kana) without CES }
213 jis0212-raw { JIS X 0212 (Extended Kanji) without CES }
214 ----------------------------------------------------------------
215
216 Encode::KR -- Korea
217 Standard DOS/Win Macintosh Comment/Reference
218 ----------------------------------------------------------------
219 euc-kr MacKorean [RFC1557]
220 cp949 [1]
221 iso-2022-kr [RFC1557]
222 johab [KS X 1001:1998, Annex 3]
223 ksc5601-raw { KSC5601 without CES }
224 ----------------------------------------------------------------
225
226 [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
227 See below.
228
229 Encode::TW -- Taiwan
230 Standard DOS/Win Macintosh Comment/Reference
231 ----------------------------------------------------------------
232 big5-eten cp950 MacChineseTrad {big5 aliased to big5-eten}
233 big5-hkscs
234 ----------------------------------------------------------------
235
236 Encode::HanExtra -- More Chinese via CPAN
237 Due to the size concerns, additional Chinese encodings below are
238 distributed separately on CPAN, under the name Encode::HanExtra.
239
240 Standard DOS/Win Macintosh Comment/Reference
241 ----------------------------------------------------------------
242 big5ext CMEX's Big5e Extension
243 big5plus CMEX's Big5+ Extension
244 cccii Chinese Character Code for Information Interchange
245 euc-tw EUC (Extended Unix Character)
246 gb18030 GBK with Traditional Characters
247 ----------------------------------------------------------------
248
249 Encode::JIS2K -- JIS X 0213 encodings via CPAN
250 Due to size concerns, additional Japanese encodings below are dis‐
251 tributed separately on CPAN, under the name Encode::JIS2K.
252
253 Standard DOS/Win Macintosh Comment/Reference
254 ----------------------------------------------------------------
255 euc-jisx0213
256 shiftjisx0123
257 iso-2022-jp-3
258 jis0213-1-raw
259 jis0213-2-raw
260 ----------------------------------------------------------------
261
262 Miscellaneous encodings
263
264 Encode::EBCDIC
265 See perlebcdic for details.
266
267 ----------------------------------------------------------------
268 cp37
269 cp500
270 cp875
271 cp1026
272 cp1047
273 posix-bc
274 ----------------------------------------------------------------
275
276 Encode::Symbols
277 For symbols and dingbats.
278
279 ----------------------------------------------------------------
280 symbol
281 dingbats
282 MacDingbats
283 AdobeZdingbat
284 AdobeSymbol
285 ----------------------------------------------------------------
286
287 Encode::MIME::Header
288 Strictly speaking, MIME header encoding documented in RFC 2047 is
289 more of encapsulation than encoding. However, their support in
290 modern world is imperative so they are supported.
291
292 ----------------------------------------------------------------
293 MIME-Header [RFC2047]
294 MIME-B [RFC2047]
295 MIME-Q [RFC2047]
296 ----------------------------------------------------------------
297
298 Encode::Guess
299 This one is not a name of encoding but a utility that lets you pick
300 up the most appropriate encoding for a data out of given suspects.
301 See Encode::Guess for details.
302
304 The following encodings are not supported as yet; some because they are
305 rarely used, some because of technical difficulties. They may be sup‐
306 ported by external modules via CPAN in the future, however.
307
308 ISO-2022-JP-2 [RFC1554]
309 Not very popular yet. Needs Unicode Database or equivalent to
310 implement encode() (because it includes JIS X 0208/0212, KSC5601,
311 and GB2312 simultaneously, whose code points in Unicode overlap.
312 So you need to lookup the database to determine to what character
313 set a given Unicode character should belong).
314
315 ISO-2022-CN [RFC1922]
316 Not very popular. Needs CNS 11643-1 and -2 which are not available
317 in this module. CNS 11643 is supported (via euc-tw) in
318 Encode::HanExtra. Autrijus Tang may add support for this encoding
319 in his module in future.
320
321 Various HP-UX encodings
322 The following are unsupported due to the lack of mapping data.
323
324 '8' - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
325 '15' - japanese15, korean15, and roi15
326
327 Cyrillic encoding ISO-IR-111
328 Anton Tagunov doubts its usefulness.
329
330 ISO-8859-8-1 [Hebrew]
331 None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
332 MacHebrew are supported because and just because there were map‐
333 pings available at <http://www.unicode.org/>). Contributions wel‐
334 come.
335
336 ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
337 Ditto.
338
339 Thai encoding TCVN
340 Ditto.
341
342 Vietnamese encodings VPS
343 Though Jungshik Shin has reported that Mozilla supports this encod‐
344 ing, it was too late before 5.8.0 for us to add it. In the future,
345 it may be available via a separate module. See
346 <http://lxr.mozilla.org/seamon‐
347 key/source/intl/uconv/ucvlatin/vps.uf> and
348 <http://lxr.mozilla.org/seamon‐
349 key/source/intl/uconv/ucvlatin/vps.ut> if you are interested in
350 helping us.
351
352 Various Mac encodings
353 The following are unsupported due to the lack of mapping data.
354
355 MacArmenian, MacBengali, MacBurmese, MacEthiopic
356 MacExtArabic, MacGeorgian, MacKannada, MacKhmer
357 MacLaotian, MacMalayalam, MacMongolian, MacOriya
358 MacSinhalese, MacTamil, MacTelugu, MacTibetan
359 MacVietnamese
360
361 The rest which are already available are based upon the vendor map‐
362 pings at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
363
364 (Mac) Indic encodings
365 The maps for the following are available at <http://www.uni‐
366 code.org/> but remain unsupport because those encodings need algo‐
367 rithmical approach, currently unsupported by enc2xs:
368
369 MacDevanagari
370 MacGurmukhi
371 MacGujarati
372
373 For details, please see "Unicode mapping issues and notes:" at
374 <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT>
375 .
376
377 I believe this issue is prevalent not only for Mac Indics but also
378 in other Indic encodings, but the above were the only Indic encod‐
379 ings maps that I could find at <http://www.unicode.org/> .
380
382 We are used to using the term (character) encoding and character set
383 interchangeably. But just as confusing the terms byte and character is
384 dangerous and the terms should be differentiated when needed, we need
385 to differentiate encoding and character set.
386
387 To understand that, here is a description of how we make computers grok
388 our characters.
389
390 · First we start with which characters to include. We call this col‐
391 lection of characters character repertoire.
392
393 · Then we have to give each character a unique ID so your computer
394 can tell the difference between 'a' and 'A'. This itemized charac‐
395 ter repertoire is now a character set.
396
397 · If your computer can grow the character set without further pro‐
398 cessing, you can go ahead and use it. This is called a coded char‐
399 acter set (CCS) or raw character encoding. ASCII is used this way
400 for most cases.
401
402 · But in many cases, especially multi-byte CJK encodings, you have to
403 tweak a little more. Your network connection may not accept any
404 data with the Most Significant Bit set, and your computer may not
405 be able to tell if a given byte is a whole character or just half
406 of it. So you have to encode the character set to use it.
407
408 A character encoding scheme (CES) determines how to encode a given
409 character set, or a set of multiple character sets. 7bit ISO-2022
410 is an example of a CES. You switch between character sets via
411 escape sequences.
412
413 Technically, or mathematically, speaking, a character set encoded in
414 such a CES that maps character by character may form a CCS. EUC is
415 such an example. The CES of EUC is as follows:
416
417 · Map ASCII unchanged.
418
419 · Map such a character set that consists of 94 or 96 powered by N
420 members by adding 0x80 to each byte.
421
422 · You can also use 0x8e and 0x8f to indicate that the following
423 sequence of characters belongs to yet another character set. To
424 each following byte is added the value 0x80.
425
426 By carefully looking at the encoded byte sequence, you can find that
427 the byte sequence conforms a unique number. In that sense, EUC is a
428 CCS generated by a CES above from up to four CCS (complicated?). UTF-8
429 falls into this category. See "UTF-8" in perlUnicode to find out how
430 UTF-8 maps Unicode to a byte sequence.
431
432 You may also have found out by now why 7bit ISO-2022 cannot comprise a
433 CCS. If you look at a byte sequence \x21\x21, you can't tell if it is
434 two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1 so you
435 have no trouble differentiating between "!!". and " ".
436
438 This section tries to classify the supported encodings by their appli‐
439 cability for information exchange over the Internet and to choose the
440 most suitable aliases to name them in the context of such communica‐
441 tion.
442
443 · To (en⎪de)code encodings marked by "(**)", you need "Encode::HanEx‐
444 tra", available from CPAN.
445
446 Encoding names
447
448 US-ASCII UTF-8 ISO-8859-* KOI8-R
449 Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
450 EUC-KR Big5 GB2312
451
452 are registered with IANA as preferred MIME names and may be used over
453 the Internet.
454
455 "Shift_JIS" has been officialized by JIS X 0208:1997. "Micro‐
456 soft-related naming mess" gives details.
457
458 "GB2312" is the IANA name for "EUC-CN". See "Microsoft-related naming
459 mess" for details.
460
461 "GB_2312-80" raw encoding is available as "gb2312-raw" with Encode. See
462 Encode::CN for details.
463
464 EUC-CN
465 KOI8-U [RFC2319]
466
467 have not been registered with IANA (as of March 2002) but seem to be
468 supported by major web browsers. The IANA name for "EUC-CN" is
469 "GB2312".
470
471 KS_C_5601-1987
472
473 is heavily misused. See "Microsoft-related naming mess" for details.
474
475 "KS_C_5601-1987" raw encoding is available as "kcs5601-raw" with
476 Encode. See Encode::KR for details.
477
478 UTF-16 UTF-16BE UTF-16LE
479
480 are IANA-registered "charset"s. See [RFC 2781] for details. Jungshik
481 Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and
482 NS 4/6. Beware however that
483
484 · "UTF-16" support in any software you're going to be using/interop‐
485 erating with has probably been less tested then "UTF-8" support
486
487 · "UTF-8" coded data seamlessly passes traditional command piping
488 ("cat", "more", etc.) while "UTF-16" coded data is likely to cause
489 confusion (with its zero bytes, for example)
490
491 · it is beyond the power of words to describe the way HTML browsers
492 encode non-"ASCII" form data. To get a general impression, visit
493 <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
494 While encoding of form data has stabilized for "UTF-8" encoded
495 pages (at least IE 5/6, NS 6, and Opera 6 behave consistently), be
496 sure to expect fun (and cross-browser discrepancies) with "UTF-16"
497 encoded pages!
498
499 The rule of thumb is to use "UTF-8" unless you know what you're doing
500 and unless you really benefit from using "UTF-16".
501
502 ISO-IR-165 [RFC1345]
503 VISCII
504 GB 12345
505 GB 18030 (**) (see links bellow)
506 EUC-TW (**)
507
508 are totally valid encodings but not registered at IANA. The names
509 under which they are listed here are probably the most widely-known
510 names for these encodings and are recommended names.
511
512 BIG5PLUS (**)
513
514 is a proprietary name.
515
516 Microsoft-related naming mess
517
518 Microsoft products misuse the following names:
519
520 KS_C_5601-1987
521 Microsoft extension to "EUC-KR".
522
523 Proper names: "CP949", "UHC", "x-windows-949" (as used by Mozilla).
524
525 See <http://lists.w3.org/Archives/Public/ietf-charsets/2001Apr‐
526 Jun/0033.html> for details.
527
528 Encode aliases "KS_C_5601-1987" to "cp949" to reflect this common
529 misusage. Raw "KS_C_5601-1987" encoding is available as
530 "kcs5601-raw".
531
532 See Encode::KR for details.
533
534 GB2312
535 Microsoft extension to "EUC-CN".
536
537 Proper names: "CP936", "GBK".
538
539 "GB2312" has been registered in the "EUC-CN" meaning at IANA. This
540 has partially repaired the situation: Microsoft's "GB2312" has
541 become a superset of the official "GB2312".
542
543 Encode aliases "GB2312" to "euc-cn" in full agreement with IANA
544 registration. "cp936" is supported separately. Raw "GB_2312-80"
545 encoding is available as "gb2312-raw".
546
547 See Encode::CN for details.
548
549 Big5
550 Microsoft extension to "Big5".
551
552 Proper name: "CP950".
553
554 Encode separately supports "Big5" and "cp950".
555
556 Shift_JIS
557 Microsoft's understanding of "Shift_JIS".
558
559 JIS has not endorsed the full Microsoft standard however. The
560 official "Shift_JIS" includes only JIS X 0201 and JIS X 0208 char‐
561 acter sets, while Microsoft has always used "Shift_JIS" to encode a
562 wider character repertoire. See "IANA" registration for "Win‐
563 dows-31J".
564
565 As a historical predecessor, Microsoft's variant probably has more
566 rights for the name, though it may be objected that Microsoft
567 shouldn't have used JIS as part of the name in the first place.
568
569 Unambiguous name: "CP932". "IANA" name (also used by Mozilla, and
570 provided as an alias by Encode): "Windows-31J".
571
572 Encode separately supports "Shift_JIS" and "cp932".
573
575 character repertoire
576 A collection of unique characters. A character set in the
577 strictest sense. At this stage, characters are not numbered.
578
579 coded character set (CCS)
580 A character set that is mapped in a way computers can use directly.
581 Many character encodings, including EUC, fall in this category.
582
583 character encoding scheme (CES)
584 An algorithm to map a character set to a byte sequence. You don't
585 have to be able to tell which character set a given byte sequence
586 belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is
587 an example of being both a CCS and CES.
588
589 charset (in MIME context)
590 has long been used in the meaning of "encoding", CES.
591
592 While the word combination "character set" has lost this meaning in
593 MIME context since [RFC 2130], the "charset" abbreviation has
594 retained it. This is how [RFC 2277] and [RFC 2278] bless "charset":
595
596 This document uses the term "charset" to mean a set of rules for
597 mapping from a sequence of octets to a sequence of characters, such
598 as the combination of a coded character set and a character encoding
599 scheme; this is also what is used as an identifier in MIME "charset="
600 parameters, and registered in the IANA charset registry ... (Note
601 that this is NOT a term used by other standards bodies, such as ISO).
602 [RFC 2277]
603
604 EUC Extended Unix Character. See ISO-2022.
605
606 ISO-2022
607 A CES that was carefully designed to coexist with ASCII. There are
608 a 7 bit version and an 8 bit version.
609
610 The 7 bit version switches character set via escape sequence so it
611 cannot form a CCS. Since this is more difficult to handle in pro‐
612 grams than the 8 bit version, the 7 bit version is not very popular
613 except for iso-2022-jp, the de facto standard CES for e-mails.
614
615 The 8 bit version can form a CCS. EUC and ISO-8859 are two exam‐
616 ples thereof. Pre-5.6 perl could use them as string literals.
617
618 UCS Short for Universal Character Set. When you say just UCS, it means
619 Unicode.
620
621 UCS-2
622 ISO/IEC 10646 encoding form: Universal Character Set coded in two
623 octets.
624
625 Unicode
626 A character set that aims to include all character repertoires of
627 the world. Many character sets in various national as well as
628 industrial standards have become, in a way, just subsets of Uni‐
629 code.
630
631 UTF Short for Unicode Transformation Format. Determines how to map a
632 Unicode character into a byte sequence.
633
634 UTF-16
635 A UTF in 16-bit encoding. Can either be in big endian or little
636 endian. The big endian version is called UTF-16BE (equal to UCS-2
637 + surrogate support) and the little endian version is called
638 UTF-16LE.
639
641 Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW,
642 Encode::EBCDIC, Encode::Symbol Encode::MIME::Header, Encode::Guess
643
645 ECMA
646 European Computer Manufacturers Association <http://www.ecma.ch>
647
648 ECMA-035 (eq "ISO-2022")
649 <http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
650
651 The specification of ISO-2022 is available from the link above.
652
653 IANA
654 Internet Assigned Numbers Authority <http://www.iana.org/>
655
656 Assigned Charset Names by IANA
657 <http://www.iana.org/assignments/character-sets>
658
659 Most of the "canonical names" in Encode derive from this list
660 so you can directly apply the string you have extracted from
661 MIME header of mails and web pages.
662
663 ISO International Organization for Standardization <http://www.iso.ch/>
664
665 RFC Request For Comments -- need I say more? <http://www.rfc-edi‐
666 tor.org/>, <http://www.rfc.net/>, <http://www.faqs.org/rfcs/>
667
668 UC Unicode Consortium <http://www.unicode.org/>
669
670 Unicode Glossary
671 <http://www.unicode.org/glossary/>
672
673 The glossary of this document is based upon this site.
674
675 Other Notable Sites
676
677 czyborra.com
678 <http://czyborra.com/>
679
680 Contains a lot of useful information, especially gory details of
681 ISO vs. vendor mappings.
682
683 CJK.inf
684 <http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
685
686 Somewhat obsolete (last update in 1996), but still useful. Also
687 try
688
689 <ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Sum‐
690 mary.pdf>
691
692 You will find brief info on "EUC-CN", "GBK" and mostly on "GB
693 18030".
694
695 Jungshik Shin's Hangul FAQ
696 <http://jshin.net/faq>
697
698 And especially its subject 8.
699
700 <http://jshin.net/faq/qa8.html>
701
702 A comprehensive overview of the Korean ("KS *") standards.
703
704 debian.org: "Introduction to i18n"
705 A brief description for most of the mentioned CJK encodings is con‐
706 tained in <http://www.debian.org/doc/manu‐
707 als/intro-i18n/ch-codes.en.html>
708
709 Offline sources
710
711 "CJKV Information Processing" by Ken Lunde
712 CJKV Information Processing 1999 O'Reilly & Associates, ISBN :
713 1-56592-224-7
714
715 The modern successor of "CJK.inf".
716
717 Features a comprehensive coverage of CJKV character sets and encod‐
718 ings along with many other issues faced by anyone trying to better
719 support CJKV languages/scripts in all the areas of information pro‐
720 cessing.
721
722 To purchase this book, visit <http://www.oreilly.com/catalog/cjkv‐
723 info/> or your favourite bookstore.
724
725
726
727perl v5.8.8 2001-09-21 Encode::Supported(3pm)