1Encode::Supported(3pm) Perl Programmers Reference Guide Encode::Supported(3pm)
2
3
4

NAME

6       Encode::Supported -- Encodings supported by Encode
7

DESCRIPTION

9   Encoding Names
10       Encoding names are case insensitive. White space in names is ignored.
11       In addition, an encoding may have aliases.  Each encoding has one
12       "canonical" name.  The "canonical" name is chosen from the names of the
13       encoding by picking the first in the following sequence (with a few
14       exceptions).
15
16       · The name used by the Perl community.  That includes 'utf8' and
17         'ascii'.  Unlike aliases, canonical names directly reach the method
18         so such frequently used words like 'utf8' don't need to do alias
19         lookups.
20
21       · The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
22
23       · The name in the IANA registry.
24
25       · The name used by the organization that defined it.
26
27       In case de jure canonical names differ from that of the Encode module,
28       they are always aliased if it ever be implemented.  So you can safely
29       tell if a given encoding is implemented or not just by passing the
30       canonical name.
31
32       Because of all the alias issues, and because in the general case
33       encodings have state, "Encode" uses an encoding object internally once
34       an operation is in progress.
35

Supported Encodings

37       As of Perl 5.8.0, at least the following encodings are recognized.
38       Note that unless otherwise specified, they are all case insensitive
39       (via alias) and all occurrence of spaces are replaced with '-'.  In
40       other words, "ISO 8859 1" and "iso-8859-1" are identical.
41
42       Encodings are categorized and implemented in several different modules
43       but you don't have to "use Encode::XX" to make them available for most
44       cases.  Encode.pm will automatically load those modules on demand.
45
46   Built-in Encodings
47       The following encodings are always available.
48
49         Canonical     Aliases                      Comments & References
50         ----------------------------------------------------------------
51         ascii         US-ascii ISO-646-US                         [ECMA]
52         ascii-ctrl                                      Special Encoding
53         iso-8859-1    latin1                                       [ISO]
54         null                                            Special Encoding
55         utf8          UTF-8                                    [RFC2279]
56         ----------------------------------------------------------------
57
58       null and ascii-ctrl are special.  "null" fails for all character so
59       when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
60       CHARACTERS will fall back to character references.  Ditto for "ascii-
61       ctrl" except for control characters.  For fallback modes, see Encode.
62
63   Encode::Unicode -- other Unicode encodings
64       Unicode coding schemes other than native utf8 are supported by
65       Encode::Unicode, which will be autoloaded on demand.
66
67         ----------------------------------------------------------------
68         UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
69         UCS-2LE                                                     [UC]
70         UTF-16                                                      [UC]
71         UTF-16BE                                                    [UC]
72         UTF-16LE                                                    [UC]
73         UTF-32                                                      [UC]
74         UTF-32BE      UCS-4                                         [UC]
75         UTF-32LE                                                    [UC]
76         UTF-7                                                  [RFC2152]
77         ----------------------------------------------------------------
78
79       To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, see
80       Encode::Unicode.
81
82       UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
83       encoding.  It is implemented seperately by Encode::Unicode::UTF7.
84
85   Encode::Byte -- Extended ASCII
86       Encode::Byte implements most single-byte encodings except for Symbols
87       and EBCDIC. The following encodings are based on single-byte encodings
88       implemented as extended ASCII.  Most of them map \x80-\xff (upper half)
89       to non-ASCII characters.
90
91       ISO-8859 and corresponding vendor mappings
92         Since there are so many, they are presented in table format with
93         languages and corresponding encoding names by vendors.  Note that the
94         table is sorted in order of ISO-8859 and the corresponding vendor
95         mappings are slightly different from that of ISO.  See
96         <http://czyborra.com/charsets/iso8859.html> for details.
97
98           Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
99           ----------------------------------------------------------------
100           N. America    (ASCII)         cp437        AdobeStandardEncoding
101                                         cp863 (DOSCanadaF)
102           W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
103                                                                  hp-roman8
104                                         cp860 (DOSPortuguese)
105           Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
106                                                         MacCroatian
107                                                         MacRomanian
108                                                         MacRumanian
109           Latin3[1]     iso-8859-3
110           Latin4[2]     iso-8859-4
111           Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
112             (See also next section)     cp866           MacUkrainian
113           Arabic        iso-8859-6      cp864   cp1256  MacArabic
114                                         cp1006          MacFarsi
115           Greek         iso-8859-7      cp737   cp1253  MacGreek
116                                         cp869 (DOSGreek2)
117           Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
118           Turkish       iso-8859-9      cp857   cp1254  MacTurkish
119           Nordics       iso-8859-10     cp865
120                                         cp861           MacIcelandic
121                                                         MacSami
122           Thai          iso-8859-11[3]  cp874           MacThai
123           (iso-8859-12 is nonexistent. Reserved for Indics?)
124           Baltics       iso-8859-13     cp775           cp1257
125           Celtics       iso-8859-14
126           Latin9 [4]    iso-8859-15
127           Latin10       iso-8859-16
128           Vietnamese    viscii                  cp1258  MacVietnamese
129           ----------------------------------------------------------------
130
131           [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
132           [2] Baltics.  Now on 8859-10, except for Latvian.
133           [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
134           [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
135               letters that are missing from 8859-1 were added.
136
137         All cp* are also available as ibm-*, ms-*, and windows-* .  See also
138         <http://czyborra.com/charsets/codepages.html>.
139
140         Macintosh encodings don't seem to be registered in such entities as
141         IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
142         1150.  See <http://developer.apple.com/technotes/tn/tn1150.html> for
143         details.
144
145       KOI8 - De Facto Standard for the Cyrillic world
146         Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
147         popular in the Net.   Encode comes with the following KOI charsets.
148         For gory details, see <http://czyborra.com/charsets/cyrillic.html>
149
150           ----------------------------------------------------------------
151           koi8-f
152           koi8-r cp878                                           [RFC1489]
153           koi8-u                                                 [RFC2319]
154           ----------------------------------------------------------------
155
156   gsm0338 - Hentai Latin 1
157       GSM0338 is for GSM handsets. Though it shares alphanumerals with ASCII,
158       control character ranges and other parts are mapped very differently,
159       mainly to store Greek characters.  There are also escape sequences
160       (starting with 0x1B) to cover e.g. the Euro sign.
161
162       This was once handled by Encode::Bytes but because of all those unusual
163       specifications, Encode 2.20 has relocated the support to
164       Encode::GSM0338. See Encode::GSM0338 for details.
165
166       gsm0338 support before 2.19
167         Some special cases like a trailing 0x00 byte or a lone 0x1B byte are
168         not well-defined and decode() will return an empty string for them.
169         One possible workaround is
170
171            $gsm =~ s/\x00\z/\x00\x00/;
172            $uni = decode("gsm0338", $gsm);
173            $uni .= "\xA0" if $gsm =~ /\x1B\z/;
174
175         Note that the Encode implementation of GSM0338 does not implement the
176         reuse of Latin capital letters as Greek capital letters (for example,
177         the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK
178         CAPITAL LETTER ZETA).
179
180         The GSM0338 is also covered in Encode::Byte even though it is not an
181         "extended ASCII" encoding.
182
183   CJK: Chinese, Japanese, Korean (Multibyte)
184       Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
185       below.  Also note that these are implemented in distinct modules by
186       countries, due to the size concerns (simplified Chinese is mapped to
187       'CN', continental China, while traditional Chinese is mapped to 'TW',
188       Taiwan).  Please refer to their respective documentation pages.
189
190       Encode::CN -- Continental China
191           Standard      DOS/Win Macintosh                Comment/Reference
192           ----------------------------------------------------------------
193           euc-cn [1]            MacChineseSimp
194           (gbk)         cp936 [2]
195           gb12345-raw                      { GB12345 without CES }
196           gb2312-raw                       { GB2312  without CES }
197           hz
198           iso-ir-165
199           ----------------------------------------------------------------
200
201           [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
202           [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
203
204       Encode::JP -- Japan
205           Standard      DOS/Win Macintosh                Comment/Reference
206           ----------------------------------------------------------------
207           euc-jp
208           shiftjis      cp932   macJapanese
209           7bit-jis
210           iso-2022-jp                                            [RFC1468]
211           iso-2022-jp-1                                          [RFC2237]
212           jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
213           jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
214           jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
215           ----------------------------------------------------------------
216
217       Encode::KR -- Korea
218           Standard      DOS/Win Macintosh                Comment/Reference
219           ----------------------------------------------------------------
220           euc-kr                MacKorean                        [RFC1557]
221                         cp949 [1]
222           iso-2022-kr                                            [RFC1557]
223           johab                                  [KS X 1001:1998, Annex 3]
224           ksc5601-raw                              { KSC5601 without CES }
225           ----------------------------------------------------------------
226
227           [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
228           See below.
229
230       Encode::TW -- Taiwan
231           Standard      DOS/Win Macintosh                Comment/Reference
232           ----------------------------------------------------------------
233           big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
234           big5-hkscs
235           ----------------------------------------------------------------
236
237       Encode::HanExtra -- More Chinese via CPAN
238         Due to the size concerns, additional Chinese encodings below are
239         distributed separately on CPAN, under the name Encode::HanExtra.
240
241           Standard      DOS/Win Macintosh                Comment/Reference
242           ----------------------------------------------------------------
243           big5ext                                   CMEX's Big5e Extension
244           big5plus                                  CMEX's Big5+ Extension
245           cccii         Chinese Character Code for Information Interchange
246           euc-tw                             EUC (Extended Unix Character)
247           gb18030                          GBK with Traditional Characters
248           ----------------------------------------------------------------
249
250       Encode::JIS2K -- JIS X 0213 encodings via CPAN
251         Due to size concerns, additional Japanese encodings below are
252         distributed separately on CPAN, under the name Encode::JIS2K.
253
254           Standard      DOS/Win Macintosh                Comment/Reference
255           ----------------------------------------------------------------
256           euc-jisx0213
257           shiftjisx0123
258           iso-2022-jp-3
259           jis0213-1-raw
260           jis0213-2-raw
261           ----------------------------------------------------------------
262
263   Miscellaneous encodings
264       Encode::EBCDIC
265         See perlebcdic for details.
266
267           ----------------------------------------------------------------
268           cp37
269           cp500
270           cp875
271           cp1026
272           cp1047
273           posix-bc
274           ----------------------------------------------------------------
275
276       Encode::Symbols
277         For symbols  and dingbats.
278
279           ----------------------------------------------------------------
280           symbol
281           dingbats
282           MacDingbats
283           AdobeZdingbat
284           AdobeSymbol
285           ----------------------------------------------------------------
286
287       Encode::MIME::Header
288         Strictly speaking, MIME header encoding documented in RFC 2047 is
289         more of encapsulation than encoding.  However, their support in
290         modern world is imperative so they are supported.
291
292           ----------------------------------------------------------------
293           MIME-Header                                            [RFC2047]
294           MIME-B                                                 [RFC2047]
295           MIME-Q                                                 [RFC2047]
296           ----------------------------------------------------------------
297
298       Encode::Guess
299         This one is not a name of encoding but a utility that lets you pick
300         up the most appropriate encoding for a data out of given suspects.
301         See Encode::Guess for details.
302

Unsupported encodings

304       The following encodings are not supported as yet; some because they are
305       rarely used, some because of technical difficulties.  They may be
306       supported by external modules via CPAN in the future, however.
307
308       ISO-2022-JP-2 [RFC1554]
309         Not very popular yet.  Needs Unicode Database or equivalent to
310         implement encode() (because it includes JIS X 0208/0212, KSC5601, and
311         GB2312 simultaneously, whose code points in Unicode overlap.  So you
312         need to lookup the database to determine to what character set a
313         given Unicode character should belong).
314
315       ISO-2022-CN [RFC1922]
316         Not very popular.  Needs CNS 11643-1 and -2 which are not available
317         in this module.  CNS 11643 is supported (via euc-tw) in
318         Encode::HanExtra.  Autrijus Tang may add support for this encoding in
319         his module in future.
320
321       Various HP-UX encodings
322         The following are unsupported due to the lack of mapping data.
323
324           '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
325           '15' - japanese15, korean15, and roi15
326
327       Cyrillic encoding ISO-IR-111
328         Anton Tagunov doubts its usefulness.
329
330       ISO-8859-8-1 [Hebrew]
331         None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
332         MacHebrew are supported because and just because there were mappings
333         available at <http://www.unicode.org/>).  Contributions welcome.
334
335       ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
336         Ditto.
337
338       Thai encoding TCVN
339         Ditto.
340
341       Vietnamese encodings VPS
342         Though Jungshik Shin has reported that Mozilla supports this
343         encoding, it was too late before 5.8.0 for us to add it.  In the
344         future, it may be available via a separate module.  See
345         <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
346         and
347         <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
348         if you are interested in helping us.
349
350       Various Mac encodings
351         The following are unsupported due to the lack of mapping data.
352
353           MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
354           MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
355           MacLaotian,   MacMalayalam, MacMongolian, MacOriya
356           MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
357           MacVietnamese
358
359         The rest which are already available are based upon the vendor
360         mappings at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
361
362       (Mac) Indic encodings
363         The maps for the following are available at <http://www.unicode.org/>
364         but remain unsupport because those encodings need algorithmical
365         approach, currently unsupported by enc2xs:
366
367           MacDevanagari
368           MacGurmukhi
369           MacGujarati
370
371         For details, please see "Unicode mapping issues and notes:" at
372         <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
373
374         I believe this issue is prevalent not only for Mac Indics but also in
375         other Indic encodings, but the above were the only Indic encodings
376         maps that I could find at <http://www.unicode.org/> .
377

Encoding vs. Charset -- terminology

379       We are used to using the term (character) encoding and character set
380       interchangeably.  But just as confusing the terms byte and character is
381       dangerous and the terms should be differentiated when needed, we need
382       to differentiate encoding and character set.
383
384       To understand that, here is a description of how we make computers grok
385       our characters.
386
387       · First we start with which characters to include.  We call this
388         collection of characters character repertoire.
389
390       · Then we have to give each character a unique ID so your computer can
391         tell the difference between 'a' and 'A'.  This itemized character
392         repertoire is now a character set.
393
394       · If your computer can grow the character set without further
395         processing, you can go ahead and use it.  This is called a coded
396         character set (CCS) or raw character encoding.  ASCII is used this
397         way for most cases.
398
399       · But in many cases, especially multi-byte CJK encodings, you have to
400         tweak a little more.  Your network connection may not accept any data
401         with the Most Significant Bit set, and your computer may not be able
402         to tell if a given byte is a whole character or just half of it.  So
403         you have to encode the character set to use it.
404
405         A character encoding scheme (CES) determines how to encode a given
406         character set, or a set of multiple character sets.  7bit ISO-2022 is
407         an example of a CES.  You switch between character sets via escape
408         sequences.
409
410       Technically, or mathematically, speaking, a character set encoded in
411       such a CES that maps character by character may form a CCS.  EUC is
412       such an example.  The CES of EUC is as follows:
413
414       · Map ASCII unchanged.
415
416       · Map such a character set that consists of 94 or 96 powered by N
417         members by adding 0x80 to each byte.
418
419       · You can also use 0x8e and 0x8f to indicate that the following
420         sequence of characters belongs to yet another character set.  To each
421         following byte is added the value 0x80.
422
423       By carefully looking at the encoded byte sequence, you can find that
424       the byte sequence conforms a unique number.  In that sense, EUC is a
425       CCS generated by a CES above from up to four CCS (complicated?).  UTF-8
426       falls into this category.  See "UTF-8" in perlUnicode to find out how
427       UTF-8 maps Unicode to a byte sequence.
428
429       You may also have found out by now why 7bit ISO-2022 cannot comprise a
430       CCS.  If you look at a byte sequence \x21\x21, you can't tell if it is
431       two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1 so you
432       have no trouble differentiating between "!!". and "  ".
433

Encoding Classification (by Anton Tagunov and Dan Kogai)

435       This section tries to classify the supported encodings by their
436       applicability for information exchange over the Internet and to choose
437       the most suitable aliases to name them in the context of such
438       communication.
439
440       · To (en|de)code encodings marked by "(**)", you need
441         "Encode::HanExtra", available from CPAN.
442
443       Encoding names
444
445         US-ASCII    UTF-8    ISO-8859-*  KOI8-R
446         Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
447         EUC-KR      Big5     GB2312
448
449       are registered with IANA as preferred MIME names and may be used over
450       the Internet.
451
452       "Shift_JIS" has been officialized by JIS X 0208:1997.  "Microsoft-
453       related naming mess" gives details.
454
455       "GB2312" is the IANA name for "EUC-CN".  See "Microsoft-related naming
456       mess" for details.
457
458       "GB_2312-80" raw encoding is available as "gb2312-raw" with Encode. See
459       Encode::CN for details.
460
461         EUC-CN
462         KOI8-U        [RFC2319]
463
464       have not been registered with IANA (as of March 2002) but seem to be
465       supported by major web browsers.  The IANA name for "EUC-CN" is
466       "GB2312".
467
468         KS_C_5601-1987
469
470       is heavily misused.  See "Microsoft-related naming mess" for details.
471
472       "KS_C_5601-1987" raw encoding is available as "kcs5601-raw" with
473       Encode. See Encode::KR for details.
474
475         UTF-16 UTF-16BE UTF-16LE
476
477       are IANA-registered "charset"s. See [RFC 2781] for details.  Jungshik
478       Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and
479       NS 4/6. Beware however that
480
481       · "UTF-16" support in any software you're going to be
482         using/interoperating with has probably been less tested then "UTF-8"
483         support
484
485       · "UTF-8" coded data seamlessly passes traditional command piping
486         ("cat", "more", etc.) while "UTF-16" coded data is likely to cause
487         confusion (with its zero bytes, for example)
488
489       · it is beyond the power of words to describe the way HTML browsers
490         encode non-"ASCII" form data. To get a general impression, visit
491         http://www.alanflavell.org.uk/charset/form-i18n.html
492         <http://www.alanflavell.org.uk/charset/form-i18n.html>.  While
493         encoding of form data has stabilized for "UTF-8" encoded pages (at
494         least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
495         expect fun (and cross-browser discrepancies) with "UTF-16" encoded
496         pages!
497
498       The rule of thumb is to use "UTF-8" unless you know what you're doing
499       and unless you really benefit from using "UTF-16".
500
501         ISO-IR-165    [RFC1345]
502         VISCII
503         GB 12345
504         GB 18030 (**)  (see links bellow)
505         EUC-TW   (**)
506
507       are totally valid encodings but not registered at IANA.  The names
508       under which they are listed here are probably the most widely-known
509       names for these encodings and are recommended names.
510
511         BIG5PLUS (**)
512
513       is a proprietary name.
514
515   Microsoft-related naming mess
516       Microsoft products misuse the following names:
517
518       KS_C_5601-1987
519         Microsoft extension to "EUC-KR".
520
521         Proper names: "CP949", "UHC", "x-windows-949" (as used by Mozilla).
522
523         See
524         http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html
525         <http://lists.w3.org/Archives/Public/ietf-
526         charsets/2001AprJun/0033.html> for details.
527
528         Encode aliases "KS_C_5601-1987" to "cp949" to reflect this common
529         misusage. Raw "KS_C_5601-1987" encoding is available as
530         "kcs5601-raw".
531
532         See Encode::KR for details.
533
534       GB2312
535         Microsoft extension to "EUC-CN".
536
537         Proper names: "CP936", "GBK".
538
539         "GB2312" has been registered in the "EUC-CN" meaning at IANA. This
540         has partially repaired the situation: Microsoft's "GB2312" has become
541         a superset of the official "GB2312".
542
543         Encode aliases "GB2312" to "euc-cn" in full agreement with IANA
544         registration. "cp936" is supported separately.  Raw "GB_2312-80"
545         encoding is available as "gb2312-raw".
546
547         See Encode::CN for details.
548
549       Big5
550         Microsoft extension to "Big5".
551
552         Proper name: "CP950".
553
554         Encode separately supports "Big5" and "cp950".
555
556       Shift_JIS
557         Microsoft's understanding of "Shift_JIS".
558
559         JIS has not endorsed the full Microsoft standard however.  The
560         official "Shift_JIS" includes only JIS X 0201 and JIS X 0208
561         character sets, while Microsoft has always used "Shift_JIS" to encode
562         a wider character repertoire. See "IANA" registration for
563         "Windows-31J".
564
565         As a historical predecessor, Microsoft's variant probably has more
566         rights for the name, though it may be objected that Microsoft
567         shouldn't have used JIS as part of the name in the first place.
568
569         Unambiguous name: "CP932". "IANA" name (also used by Mozilla, and
570         provided as an alias by Encode): "Windows-31J".
571
572         Encode separately supports "Shift_JIS" and "cp932".
573

Glossary

575       character repertoire
576         A collection of unique characters.  A character set in the strictest
577         sense. At this stage, characters are not numbered.
578
579       coded character set (CCS)
580         A character set that is mapped in a way computers can use directly.
581         Many character encodings, including EUC, fall in this category.
582
583       character encoding scheme (CES)
584         An algorithm to map a character set to a byte sequence.  You don't
585         have to be able to tell which character set a given byte sequence
586         belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
587         example of being both a CCS and CES.
588
589       charset (in MIME context)
590         has long been used in the meaning of "encoding", CES.
591
592         While the word combination "character set" has lost this meaning in
593         MIME context since [RFC 2130], the "charset" abbreviation has
594         retained it. This is how [RFC 2277] and [RFC 2278] bless "charset":
595
596          This document uses the term "charset" to mean a set of rules for
597          mapping from a sequence of octets to a sequence of characters, such
598          as the combination of a coded character set and a character encoding
599          scheme; this is also what is used as an identifier in MIME "charset="
600          parameters, and registered in the IANA charset registry ...  (Note
601          that this is NOT a term used by other standards bodies, such as ISO).
602          [RFC 2277]
603
604       EUC
605         Extended Unix Character.  See ISO-2022.
606
607       ISO-2022
608         A CES that was carefully designed to coexist with ASCII.  There are a
609         7 bit version and an 8 bit version.
610
611         The 7 bit version switches character set via escape sequence so it
612         cannot form a CCS.  Since this is more difficult to handle in
613         programs than the 8 bit version, the 7 bit version is not very
614         popular except for iso-2022-jp, the de facto standard CES for
615         e-mails.
616
617         The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
618         thereof.  Pre-5.6 perl could use them as string literals.
619
620       UCS
621         Short for Universal Character Set.  When you say just UCS, it means
622         Unicode.
623
624       UCS-2
625         ISO/IEC 10646 encoding form: Universal Character Set coded in two
626         octets.
627
628       Unicode
629         A character set that aims to include all character repertoires of the
630         world.  Many character sets in various national as well as industrial
631         standards have become, in a way, just subsets of Unicode.
632
633       UTF
634         Short for Unicode Transformation Format.  Determines how to map a
635         Unicode character into a byte sequence.
636
637       UTF-16
638         A UTF in 16-bit encoding.  Can either be in big endian or little
639         endian.  The big endian version is called UTF-16BE (equal to UCS-2 +
640         surrogate support) and the little endian version is called UTF-16LE.
641

See Also

643       Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW,
644       Encode::EBCDIC, Encode::Symbol Encode::MIME::Header, Encode::Guess
645

References

647       ECMA
648         European Computer Manufacturers Association <http://www.ecma.ch>
649
650         ECMA-035 (eq "ISO-2022")
651           http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM
652           <http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
653
654           The specification of ISO-2022 is available from the link above.
655
656       IANA
657         Internet Assigned Numbers Authority <http://www.iana.org/>
658
659         Assigned Charset Names by IANA
660           http://www.iana.org/assignments/character-sets
661           <http://www.iana.org/assignments/character-sets>
662
663           Most of the "canonical names" in Encode derive from this list so
664           you can directly apply the string you have extracted from MIME
665           header of mails and web pages.
666
667       ISO
668         International Organization for Standardization <http://www.iso.ch/>
669
670       RFC
671         Request For Comments -- need I say more?  http://www.rfc-editor.org/
672         <http://www.rfc-editor.org/>, <http://www.ietf.org/rfc.html>,
673         <http://www.faqs.org/rfcs/>
674
675       UC
676         Unicode Consortium <http://www.unicode.org/>
677
678         Unicode Glossary
679           <http://www.unicode.org/glossary/>
680
681           The glossary of this document is based upon this site.
682
683   Other Notable Sites
684       czyborra.com
685         <http://czyborra.com/>
686
687         Contains a lot of useful information, especially gory details of ISO
688         vs. vendor mappings.
689
690       CJK.inf
691         <http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
692
693         Somewhat obsolete (last update in 1996), but still useful.  Also try
694
695         <ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
696
697         You will find brief info on "EUC-CN", "GBK" and mostly on "GB 18030".
698
699       Jungshik Shin's Hangul FAQ
700         <http://jshin.net/faq>
701
702         And especially its subject 8.
703
704         <http://jshin.net/faq/qa8.html>
705
706         A comprehensive overview of the Korean ("KS *") standards.
707
708       debian.org: "Introduction to i18n"
709         A brief description for most of the mentioned CJK encodings is
710         contained in
711         http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html
712         <http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
713
714   Offline sources
715       "CJKV Information Processing" by Ken Lunde
716         CJKV Information Processing 1999 O'Reilly & Associates, ISBN :
717         1-56592-224-7
718
719         The modern successor of "CJK.inf".
720
721         Features a comprehensive coverage of CJKV character sets and
722         encodings along with many other issues faced by anyone trying to
723         better support CJKV languages/scripts in all the areas of information
724         processing.
725
726         To purchase this book, visit
727         <http://oreilly.com/catalog/9780596514471/> or your favourite
728         bookstore.
729
730
731
732perl v5.12.4                      2011-06-01            Encode::Supported(3pm)
Impressum