1Encode::Supported(3pm) Perl Programmers Reference Guide Encode::Supported(3pm)
2
3
4

NAME

6       Encode::Supported -- Encodings supported by Encode
7

DESCRIPTION

9       Encoding Names
10
11       Encoding names are case insensitive. White space in names is ignored.
12       In addition, an encoding may have aliases.  Each encoding has one
13       "canonical" name.  The "canonical" name is chosen from the names of the
14       encoding by picking the first in the following sequence (with a few
15       exceptions).
16
17       ·   The name used by the Perl community.  That includes 'utf8' and
18           'ascii'.  Unlike aliases, canonical names directly reach the method
19           so such frequently used words like 'utf8' don't need to do alias
20           lookups.
21
22       ·   The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
23
24       ·   The name in the IANA registry.
25
26       ·   The name used by the organization that defined it.
27
28       In case de jure canonical names differ from that of the Encode module,
29       they are always aliased if it ever be implemented.  So you can safely
30       tell if a given encoding is implemented or not just by passing the
31       canonical name.
32
33       Because of all the alias issues, and because in the general case encod‐
34       ings have state, "Encode" uses an encoding object internally once an
35       operation is in progress.
36

Supported Encodings

38       As of Perl 5.8.0, at least the following encodings are recognized.
39       Note that unless otherwise specified, they are all case insensitive
40       (via alias) and all occurrence of spaces are replaced with '-'.  In
41       other words, "ISO 8859 1" and "iso-8859-1" are identical.
42
43       Encodings are categorized and implemented in several different modules
44       but you don't have to "use Encode::XX" to make them available for most
45       cases.  Encode.pm will automatically load those modules on demand.
46
47       Built-in Encodings
48
49       The following encodings are always available.
50
51         Canonical     Aliases                      Comments & References
52         ----------------------------------------------------------------
53         ascii         US-ascii ISO-646-US                         [ECMA]
54         ascii-ctrl                                      Special Encoding
55         iso-8859-1    latin1                                       [ISO]
56         null                                            Special Encoding
57         utf8          UTF-8                                    [RFC2279]
58         ----------------------------------------------------------------
59
60       null and ascii-ctrl are special.  "null" fails for all character so
61       when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL CHARAC‐
62       TERS will fall back to character references.  Ditto for "ascii-ctrl"
63       except for control characters.  For fallback modes, see Encode.
64
65       Encode::Unicode -- other Unicode encodings
66
67       Unicode coding schemes other than native utf8 are supported by
68       Encode::Unicode, which will be autoloaded on demand.
69
70         ----------------------------------------------------------------
71         UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
72         UCS-2LE                                                     [UC]
73         UTF-16                                                      [UC]
74         UTF-16BE                                                    [UC]
75         UTF-16LE                                                    [UC]
76         UTF-32                                                      [UC]
77         UTF-32BE      UCS-4                                         [UC]
78         UTF-32LE                                                    [UC]
79         UTF-7                                                  [RFC2152]
80         ----------------------------------------------------------------
81
82       To find how (UCS-2⎪UTF-(16⎪32))(LE⎪BE)? differ from one another, see
83       Encode::Unicode.
84
85       UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
86       encoding.  It is implemented seperately by Encode::Unicode::UTF7.
87
88       Encode::Byte -- Extended ASCII
89
90       Encode::Byte implements most single-byte encodings except for Symbols
91       and EBCDIC. The following encodings are based on single-byte encodings
92       implemented as extended ASCII.  Most of them map \x80-\xff (upper half)
93       to non-ASCII characters.
94
95       ISO-8859 and corresponding vendor mappings
96           Since there are so many, they are presented in table format with
97           languages and corresponding encoding names by vendors.  Note that
98           the table is sorted in order of ISO-8859 and the corresponding ven‐
99           dor mappings are slightly different from that of ISO.  See
100           <http://czyborra.com/charsets/iso8859.html> for details.
101
102             Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
103             ----------------------------------------------------------------
104             N. America    (ASCII)         cp437        AdobeStandardEncoding
105                                           cp863 (DOSCanadaF)
106             W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
107                                                                    hp-roman8
108                                           cp860 (DOSPortuguese)
109             Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
110                                                           MacCroatian
111                                                           MacRomanian
112                                                           MacRumanian
113             Latin3[1]     iso-8859-3
114             Latin4[2]     iso-8859-4
115             Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
116               (See also next section)     cp866           MacUkrainian
117             Arabic        iso-8859-6      cp864   cp1256  MacArabic
118                                           cp1006          MacFarsi
119             Greek         iso-8859-7      cp737   cp1253  MacGreek
120                                           cp869 (DOSGreek2)
121             Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
122             Turkish       iso-8859-9      cp857   cp1254  MacTurkish
123             Nordics       iso-8859-10     cp865
124                                           cp861           MacIcelandic
125                                                           MacSami
126             Thai          iso-8859-11[3]  cp874           MacThai
127             (iso-8859-12 is nonexistent. Reserved for Indics?)
128             Baltics       iso-8859-13     cp775           cp1257
129             Celtics       iso-8859-14
130             Latin9 [4]    iso-8859-15
131             Latin10       iso-8859-16
132             Vietnamese    viscii                  cp1258  MacVietnamese
133             ----------------------------------------------------------------
134
135             [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
136             [2] Baltics.  Now on 8859-10, except for Latvian.
137             [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
138             [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
139                 letters that are missing from 8859-1 were added.
140
141           All cp* are also available as ibm-*, ms-*, and windows-* .  See
142           also <http://czyborra.com/charsets/codepages.html>.
143
144           Macintosh encodings don't seem to be registered in such entities as
145           IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
146           1150.  See <http://developer.apple.com/technotes/tn/tn1150.html>
147           for details.
148
149       KOI8 - De Facto Standard for the Cyrillic world
150           Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
151           popular in the Net.   Encode comes with the following KOI charsets.
152           For gory details, see <http://czyborra.com/charsets/cyrillic.html>
153
154             ----------------------------------------------------------------
155             koi8-f
156             koi8-r cp878                                           [RFC1489]
157             koi8-u                                                 [RFC2319]
158             ----------------------------------------------------------------
159
160       gsm0338 - Hentai Latin 1
161           GSM0338 is for GSM handsets. Though it shares alphanumerals with
162           ASCII, control character ranges and other parts are mapped very
163           differently, mainly to store Greek characters.  There are also
164           escape sequences (starting with 0x1B) to cover e.g. the Euro sign.
165           Some special cases like a trailing 0x00 byte or a lone 0x1B byte
166           are not well-defined and decode() will return an empty string for
167           them.  One possible workaround is
168
169              $gsm =~ s/\x00\z/\x00\x00/;
170              $uni = decode("gsm0338", $gsm);
171              $uni .= "\xA0" if $gsm =~ /\x1B\z/;
172
173           Note that the Encode implementation of GSM0338 does not implement
174           the reuse of Latin capital letters as Greek capital letters (for
175           example, the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396
176           (GREEK CAPITAL LETTER ZETA).
177
178           The GSM0338 is also covered in Encode::Byte even though it is not
179           an "extended ASCII" encoding.
180
181       CJK: Chinese, Japanese, Korean (Multibyte)
182
183       Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
184       below.  Also note that these are implemented in distinct modules by
185       countries, due to the size concerns (simplified Chinese is mapped to
186       'CN', continental China, while traditional Chinese is mapped to 'TW',
187       Taiwan).  Please refer to their respective documentation pages.
188
189       Encode::CN -- Continental China
190             Standard      DOS/Win Macintosh                Comment/Reference
191             ----------------------------------------------------------------
192             euc-cn [1]            MacChineseSimp
193             (gbk)         cp936 [2]
194             gb12345-raw                      { GB12345 without CES }
195             gb2312-raw                       { GB2312  without CES }
196             hz
197             iso-ir-165
198             ----------------------------------------------------------------
199
200             [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
201             [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
202
203       Encode::JP -- Japan
204             Standard      DOS/Win Macintosh                Comment/Reference
205             ----------------------------------------------------------------
206             euc-jp
207             shiftjis      cp932   macJapanese
208             7bit-jis
209             iso-2022-jp                                            [RFC1468]
210             iso-2022-jp-1                                          [RFC2237]
211             jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
212             jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
213             jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
214             ----------------------------------------------------------------
215
216       Encode::KR -- Korea
217             Standard      DOS/Win Macintosh                Comment/Reference
218             ----------------------------------------------------------------
219             euc-kr                MacKorean                        [RFC1557]
220                           cp949 [1]
221             iso-2022-kr                                            [RFC1557]
222             johab                                  [KS X 1001:1998, Annex 3]
223             ksc5601-raw                              { KSC5601 without CES }
224             ----------------------------------------------------------------
225
226             [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
227             See below.
228
229       Encode::TW -- Taiwan
230             Standard      DOS/Win Macintosh                Comment/Reference
231             ----------------------------------------------------------------
232             big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
233             big5-hkscs
234             ----------------------------------------------------------------
235
236       Encode::HanExtra -- More Chinese via CPAN
237           Due to the size concerns, additional Chinese encodings below are
238           distributed separately on CPAN, under the name Encode::HanExtra.
239
240             Standard      DOS/Win Macintosh                Comment/Reference
241             ----------------------------------------------------------------
242             big5ext                                   CMEX's Big5e Extension
243             big5plus                                  CMEX's Big5+ Extension
244             cccii         Chinese Character Code for Information Interchange
245             euc-tw                             EUC (Extended Unix Character)
246             gb18030                          GBK with Traditional Characters
247             ----------------------------------------------------------------
248
249       Encode::JIS2K -- JIS X 0213 encodings via CPAN
250           Due to size concerns, additional Japanese encodings below are dis‐
251           tributed separately on CPAN, under the name Encode::JIS2K.
252
253             Standard      DOS/Win Macintosh                Comment/Reference
254             ----------------------------------------------------------------
255             euc-jisx0213
256             shiftjisx0123
257             iso-2022-jp-3
258             jis0213-1-raw
259             jis0213-2-raw
260             ----------------------------------------------------------------
261
262       Miscellaneous encodings
263
264       Encode::EBCDIC
265           See perlebcdic for details.
266
267             ----------------------------------------------------------------
268             cp37
269             cp500
270             cp875
271             cp1026
272             cp1047
273             posix-bc
274             ----------------------------------------------------------------
275
276       Encode::Symbols
277           For symbols  and dingbats.
278
279             ----------------------------------------------------------------
280             symbol
281             dingbats
282             MacDingbats
283             AdobeZdingbat
284             AdobeSymbol
285             ----------------------------------------------------------------
286
287       Encode::MIME::Header
288           Strictly speaking, MIME header encoding documented in RFC 2047 is
289           more of encapsulation than encoding.  However, their support in
290           modern world is imperative so they are supported.
291
292             ----------------------------------------------------------------
293             MIME-Header                                            [RFC2047]
294             MIME-B                                                 [RFC2047]
295             MIME-Q                                                 [RFC2047]
296             ----------------------------------------------------------------
297
298       Encode::Guess
299           This one is not a name of encoding but a utility that lets you pick
300           up the most appropriate encoding for a data out of given suspects.
301           See Encode::Guess for details.
302

Unsupported encodings

304       The following encodings are not supported as yet; some because they are
305       rarely used, some because of technical difficulties.  They may be sup‐
306       ported by external modules via CPAN in the future, however.
307
308       ISO-2022-JP-2 [RFC1554]
309           Not very popular yet.  Needs Unicode Database or equivalent to
310           implement encode() (because it includes JIS X 0208/0212, KSC5601,
311           and GB2312 simultaneously, whose code points in Unicode overlap.
312           So you need to lookup the database to determine to what character
313           set a given Unicode character should belong).
314
315       ISO-2022-CN [RFC1922]
316           Not very popular.  Needs CNS 11643-1 and -2 which are not available
317           in this module.  CNS 11643 is supported (via euc-tw) in
318           Encode::HanExtra.  Autrijus Tang may add support for this encoding
319           in his module in future.
320
321       Various HP-UX encodings
322           The following are unsupported due to the lack of mapping data.
323
324             '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
325             '15' - japanese15, korean15, and roi15
326
327       Cyrillic encoding ISO-IR-111
328           Anton Tagunov doubts its usefulness.
329
330       ISO-8859-8-1 [Hebrew]
331           None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
332           MacHebrew are supported because and just because there were map‐
333           pings available at <http://www.unicode.org/>).  Contributions wel‐
334           come.
335
336       ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
337           Ditto.
338
339       Thai encoding TCVN
340           Ditto.
341
342       Vietnamese encodings VPS
343           Though Jungshik Shin has reported that Mozilla supports this encod‐
344           ing, it was too late before 5.8.0 for us to add it.  In the future,
345           it may be available via a separate module.  See
346           <http://lxr.mozilla.org/seamon
347           key/source/intl/uconv/ucvlatin/vps.uf> and
348           <http://lxr.mozilla.org/seamon
349           key/source/intl/uconv/ucvlatin/vps.ut> if you are interested in
350           helping us.
351
352       Various Mac encodings
353           The following are unsupported due to the lack of mapping data.
354
355             MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
356             MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
357             MacLaotian,   MacMalayalam, MacMongolian, MacOriya
358             MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
359             MacVietnamese
360
361           The rest which are already available are based upon the vendor map‐
362           pings at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
363
364       (Mac) Indic encodings
365           The maps for the following are available at <http://www.uni
366           code.org/> but remain unsupport because those encodings need algo‐
367           rithmical approach, currently unsupported by enc2xs:
368
369             MacDevanagari
370             MacGurmukhi
371             MacGujarati
372
373           For details, please see "Unicode mapping issues and notes:" at
374           <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT>
375           .
376
377           I believe this issue is prevalent not only for Mac Indics but also
378           in other Indic encodings, but the above were the only Indic encod‐
379           ings maps that I could find at <http://www.unicode.org/> .
380

Encoding vs. Charset -- terminology

382       We are used to using the term (character) encoding and character set
383       interchangeably.  But just as confusing the terms byte and character is
384       dangerous and the terms should be differentiated when needed, we need
385       to differentiate encoding and character set.
386
387       To understand that, here is a description of how we make computers grok
388       our characters.
389
390       ·   First we start with which characters to include.  We call this col‐
391           lection of characters character repertoire.
392
393       ·   Then we have to give each character a unique ID so your computer
394           can tell the difference between 'a' and 'A'.  This itemized charac‐
395           ter repertoire is now a character set.
396
397       ·   If your computer can grow the character set without further pro‐
398           cessing, you can go ahead and use it.  This is called a coded char‐
399           acter set (CCS) or raw character encoding.  ASCII is used this way
400           for most cases.
401
402       ·   But in many cases, especially multi-byte CJK encodings, you have to
403           tweak a little more.  Your network connection may not accept any
404           data with the Most Significant Bit set, and your computer may not
405           be able to tell if a given byte is a whole character or just half
406           of it.  So you have to encode the character set to use it.
407
408           A character encoding scheme (CES) determines how to encode a given
409           character set, or a set of multiple character sets.  7bit ISO-2022
410           is an example of a CES.  You switch between character sets via
411           escape sequences.
412
413       Technically, or mathematically, speaking, a character set encoded in
414       such a CES that maps character by character may form a CCS.  EUC is
415       such an example.  The CES of EUC is as follows:
416
417       ·   Map ASCII unchanged.
418
419       ·   Map such a character set that consists of 94 or 96 powered by N
420           members by adding 0x80 to each byte.
421
422       ·   You can also use 0x8e and 0x8f to indicate that the following
423           sequence of characters belongs to yet another character set.  To
424           each following byte is added the value 0x80.
425
426       By carefully looking at the encoded byte sequence, you can find that
427       the byte sequence conforms a unique number.  In that sense, EUC is a
428       CCS generated by a CES above from up to four CCS (complicated?).  UTF-8
429       falls into this category.  See "UTF-8" in perlUnicode to find out how
430       UTF-8 maps Unicode to a byte sequence.
431
432       You may also have found out by now why 7bit ISO-2022 cannot comprise a
433       CCS.  If you look at a byte sequence \x21\x21, you can't tell if it is
434       two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1 so you
435       have no trouble differentiating between "!!". and " ".
436

Encoding Classification (by Anton Tagunov and Dan Kogai)

438       This section tries to classify the supported encodings by their appli‐
439       cability for information exchange over the Internet and to choose the
440       most suitable aliases to name them in the context of such communica‐
441       tion.
442
443       ·   To (en⎪de)code encodings marked by "(**)", you need "Encode::HanEx‐
444           tra", available from CPAN.
445
446       Encoding names
447
448         US-ASCII    UTF-8    ISO-8859-*  KOI8-R
449         Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
450         EUC-KR      Big5     GB2312
451
452       are registered with IANA as preferred MIME names and may be used over
453       the Internet.
454
455       "Shift_JIS" has been officialized by JIS X 0208:1997.  "Micro‐
456       soft-related naming mess" gives details.
457
458       "GB2312" is the IANA name for "EUC-CN".  See "Microsoft-related naming
459       mess" for details.
460
461       "GB_2312-80" raw encoding is available as "gb2312-raw" with Encode. See
462       Encode::CN for details.
463
464         EUC-CN
465         KOI8-U        [RFC2319]
466
467       have not been registered with IANA (as of March 2002) but seem to be
468       supported by major web browsers.  The IANA name for "EUC-CN" is
469       "GB2312".
470
471         KS_C_5601-1987
472
473       is heavily misused.  See "Microsoft-related naming mess" for details.
474
475       "KS_C_5601-1987" raw encoding is available as "kcs5601-raw" with
476       Encode. See Encode::KR for details.
477
478         UTF-16 UTF-16BE UTF-16LE
479
480       are IANA-registered "charset"s. See [RFC 2781] for details.  Jungshik
481       Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and
482       NS 4/6. Beware however that
483
484       ·   "UTF-16" support in any software you're going to be using/interop‐
485           erating with has probably been less tested then "UTF-8" support
486
487       ·   "UTF-8" coded data seamlessly passes traditional command piping
488           ("cat", "more", etc.) while "UTF-16" coded data is likely to cause
489           confusion (with its zero bytes, for example)
490
491       ·   it is beyond the power of words to describe the way HTML browsers
492           encode non-"ASCII" form data. To get a general impression, visit
493           <http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>.
494           While encoding of form data has stabilized for "UTF-8" encoded
495           pages (at least IE 5/6, NS 6, and Opera 6 behave consistently), be
496           sure to expect fun (and cross-browser discrepancies) with "UTF-16"
497           encoded pages!
498
499       The rule of thumb is to use "UTF-8" unless you know what you're doing
500       and unless you really benefit from using "UTF-16".
501
502         ISO-IR-165    [RFC1345]
503         VISCII
504         GB 12345
505         GB 18030 (**)  (see links bellow)
506         EUC-TW   (**)
507
508       are totally valid encodings but not registered at IANA.  The names
509       under which they are listed here are probably the most widely-known
510       names for these encodings and are recommended names.
511
512         BIG5PLUS (**)
513
514       is a proprietary name.
515
516       Microsoft-related naming mess
517
518       Microsoft products misuse the following names:
519
520       KS_C_5601-1987
521           Microsoft extension to "EUC-KR".
522
523           Proper names: "CP949", "UHC", "x-windows-949" (as used by Mozilla).
524
525           See <http://lists.w3.org/Archives/Public/ietf-charsets/2001Apr
526           Jun/0033.html> for details.
527
528           Encode aliases "KS_C_5601-1987" to "cp949" to reflect this common
529           misusage. Raw "KS_C_5601-1987" encoding is available as
530           "kcs5601-raw".
531
532           See Encode::KR for details.
533
534       GB2312
535           Microsoft extension to "EUC-CN".
536
537           Proper names: "CP936", "GBK".
538
539           "GB2312" has been registered in the "EUC-CN" meaning at IANA. This
540           has partially repaired the situation: Microsoft's "GB2312" has
541           become a superset of the official "GB2312".
542
543           Encode aliases "GB2312" to "euc-cn" in full agreement with IANA
544           registration. "cp936" is supported separately.  Raw "GB_2312-80"
545           encoding is available as "gb2312-raw".
546
547           See Encode::CN for details.
548
549       Big5
550           Microsoft extension to "Big5".
551
552           Proper name: "CP950".
553
554           Encode separately supports "Big5" and "cp950".
555
556       Shift_JIS
557           Microsoft's understanding of "Shift_JIS".
558
559           JIS has not endorsed the full Microsoft standard however.  The
560           official "Shift_JIS" includes only JIS X 0201 and JIS X 0208 char‐
561           acter sets, while Microsoft has always used "Shift_JIS" to encode a
562           wider character repertoire. See "IANA" registration for "Win‐
563           dows-31J".
564
565           As a historical predecessor, Microsoft's variant probably has more
566           rights for the name, though it may be objected that Microsoft
567           shouldn't have used JIS as part of the name in the first place.
568
569           Unambiguous name: "CP932". "IANA" name (also used by Mozilla, and
570           provided as an alias by Encode): "Windows-31J".
571
572           Encode separately supports "Shift_JIS" and "cp932".
573

Glossary

575       character repertoire
576           A collection of unique characters.  A character set in the
577           strictest sense. At this stage, characters are not numbered.
578
579       coded character set (CCS)
580           A character set that is mapped in a way computers can use directly.
581           Many character encodings, including EUC, fall in this category.
582
583       character encoding scheme (CES)
584           An algorithm to map a character set to a byte sequence.  You don't
585           have to be able to tell which character set a given byte sequence
586           belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is
587           an example of being both a CCS and CES.
588
589       charset (in MIME context)
590           has long been used in the meaning of "encoding", CES.
591
592           While the word combination "character set" has lost this meaning in
593           MIME context since [RFC 2130], the "charset" abbreviation has
594           retained it. This is how [RFC 2277] and [RFC 2278] bless "charset":
595
596            This document uses the term "charset" to mean a set of rules for
597            mapping from a sequence of octets to a sequence of characters, such
598            as the combination of a coded character set and a character encoding
599            scheme; this is also what is used as an identifier in MIME "charset="
600            parameters, and registered in the IANA charset registry ...  (Note
601            that this is NOT a term used by other standards bodies, such as ISO).
602            [RFC 2277]
603
604       EUC Extended Unix Character.  See ISO-2022.
605
606       ISO-2022
607           A CES that was carefully designed to coexist with ASCII.  There are
608           a 7 bit version and an 8 bit version.
609
610           The 7 bit version switches character set via escape sequence so it
611           cannot form a CCS.  Since this is more difficult to handle in pro‐
612           grams than the 8 bit version, the 7 bit version is not very popular
613           except for iso-2022-jp, the de facto standard CES for e-mails.
614
615           The 8 bit version can form a CCS.  EUC and ISO-8859 are two exam‐
616           ples thereof.  Pre-5.6 perl could use them as string literals.
617
618       UCS Short for Universal Character Set.  When you say just UCS, it means
619           Unicode.
620
621       UCS-2
622           ISO/IEC 10646 encoding form: Universal Character Set coded in two
623           octets.
624
625       Unicode
626           A character set that aims to include all character repertoires of
627           the world.  Many character sets in various national as well as
628           industrial standards have become, in a way, just subsets of Uni‐
629           code.
630
631       UTF Short for Unicode Transformation Format.  Determines how to map a
632           Unicode character into a byte sequence.
633
634       UTF-16
635           A UTF in 16-bit encoding.  Can either be in big endian or little
636           endian.  The big endian version is called UTF-16BE (equal to UCS-2
637           + surrogate support) and the little endian version is called
638           UTF-16LE.
639

See Also

641       Encode, Encode::Byte, Encode::CN, Encode::JP, Encode::KR, Encode::TW,
642       Encode::EBCDIC, Encode::Symbol Encode::MIME::Header, Encode::Guess
643

References

645       ECMA
646           European Computer Manufacturers Association <http://www.ecma.ch>
647
648           ECMA-035 (eq "ISO-2022")
649               <http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
650
651               The specification of ISO-2022 is available from the link above.
652
653       IANA
654           Internet Assigned Numbers Authority <http://www.iana.org/>
655
656           Assigned Charset Names by IANA
657               <http://www.iana.org/assignments/character-sets>
658
659               Most of the "canonical names" in Encode derive from this list
660               so you can directly apply the string you have extracted from
661               MIME header of mails and web pages.
662
663       ISO International Organization for Standardization <http://www.iso.ch/>
664
665       RFC Request For Comments -- need I say more?  <http://www.rfc-edi
666           tor.org/>, <http://www.rfc.net/>, <http://www.faqs.org/rfcs/>
667
668       UC  Unicode Consortium <http://www.unicode.org/>
669
670           Unicode Glossary
671               <http://www.unicode.org/glossary/>
672
673               The glossary of this document is based upon this site.
674
675       Other Notable Sites
676
677       czyborra.com
678           <http://czyborra.com/>
679
680           Contains a lot of useful information, especially gory details of
681           ISO vs. vendor mappings.
682
683       CJK.inf
684           <http://www.oreilly.com/people/authors/lunde/cjk_inf.html>
685
686           Somewhat obsolete (last update in 1996), but still useful.  Also
687           try
688
689           <ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Sum‐
690           mary.pdf>
691
692           You will find brief info on "EUC-CN", "GBK" and mostly on "GB
693           18030".
694
695       Jungshik Shin's Hangul FAQ
696           <http://jshin.net/faq>
697
698           And especially its subject 8.
699
700           <http://jshin.net/faq/qa8.html>
701
702           A comprehensive overview of the Korean ("KS *") standards.
703
704       debian.org: "Introduction to i18n"
705           A brief description for most of the mentioned CJK encodings is con‐
706           tained in <http://www.debian.org/doc/manu
707           als/intro-i18n/ch-codes.en.html>
708
709       Offline sources
710
711       "CJKV Information Processing" by Ken Lunde
712           CJKV Information Processing 1999 O'Reilly & Associates, ISBN :
713           1-56592-224-7
714
715           The modern successor of "CJK.inf".
716
717           Features a comprehensive coverage of CJKV character sets and encod‐
718           ings along with many other issues faced by anyone trying to better
719           support CJKV languages/scripts in all the areas of information pro‐
720           cessing.
721
722           To purchase this book, visit <http://www.oreilly.com/catalog/cjkv
723           info/> or your favourite bookstore.
724
725
726
727perl v5.8.8                       2001-09-21            Encode::Supported(3pm)
Impressum