Encode::Supported(3pm)

1Encode::Supported(3)  User Contributed Perl Documentation Encode::Supported(3)
2
3
4

NAME

6       Encode::Supported -- Encodings supported by Encode
7

DESCRIPTION

9   Encoding Names
10       Encoding names are case insensitive. White space in names is ignored.
11       In addition, an encoding may have aliases.  Each encoding has one
12       "canonical" name.  The "canonical" name is chosen from the names of the
13       encoding by picking the first in the following sequence (with a few
14       exceptions).
15
16       • The name used by the Perl community.  That includes 'utf8' and
17         'ascii'.  Unlike aliases, canonical names directly reach the method
18         so such frequently used words like 'utf8' don't need to do alias
19         lookups.
20
21       • The MIME name as defined in IETF RFCs.  This includes all "iso-"s.
22
23       • The name in the IANA registry.
24
25       • The name used by the organization that defined it.
26
27       In case de jure canonical names differ from that of the Encode module,
28       they are always aliased if it ever be implemented.  So you can safely
29       tell if a given encoding is implemented or not just by passing the
30       canonical name.
31
32       Because of all the alias issues, and because in the general case
33       encodings have state, "Encode" uses an encoding object internally once
34       an operation is in progress.
35

Supported Encodings

37       As of Perl 5.8.0, at least the following encodings are recognized.
38       Note that unless otherwise specified, they are all case insensitive
39       (via alias) and all occurrence of spaces are replaced with '-'.  In
40       other words, "ISO 8859 1" and "iso-8859-1" are identical.
41
42       Encodings are categorized and implemented in several different modules
43       but you don't have to "use Encode::XX" to make them available for most
44       cases.  Encode.pm will automatically load those modules on demand.
45
46   Built-in Encodings
47       The following encodings are always available.
48
49         Canonical     Aliases                      Comments & References
50         ----------------------------------------------------------------
51         ascii         US-ascii ISO-646-US                         [ECMA]
52         ascii-ctrl                                      Special Encoding
53         iso-8859-1    latin1                                       [ISO]
54         null                                            Special Encoding
55         utf8          UTF-8                                    [RFC2279]
56         ----------------------------------------------------------------
57
58       null and ascii-ctrl are special.  "null" fails for all character so
59       when you set fallback mode to PERLQQ, HTMLCREF or XMLCREF, ALL
60       CHARACTERS will fall back to character references.  Ditto for "ascii-
61       ctrl" except for control characters.  For fallback modes, see Encode.
62
63   Encode::Unicode -- other Unicode encodings
64       Unicode coding schemes other than native utf8 are supported by
65       Encode::Unicode, which will be autoloaded on demand.
66
67         ----------------------------------------------------------------
68         UCS-2BE       UCS-2, iso-10646-1                      [IANA, UC]
69         UCS-2LE                                                     [UC]
70         UTF-16                                                      [UC]
71         UTF-16BE                                                    [UC]
72         UTF-16LE                                                    [UC]
73         UTF-32                                                      [UC]
74         UTF-32BE      UCS-4                                         [UC]
75         UTF-32LE                                                    [UC]
76         UTF-7                                                  [RFC2152]
77         ----------------------------------------------------------------
78
79       To find how (UCS-2|UTF-(16|32))(LE|BE)? differ from one another, see
80       Encode::Unicode.
81
82       UTF-7 is a special encoding which "re-encodes" UTF-16BE into a 7-bit
83       encoding.  It is implemented separately by Encode::Unicode::UTF7.
84
85   Encode::Byte -- Extended ASCII
86       Encode::Byte implements most single-byte encodings except for Symbols
87       and EBCDIC. The following encodings are based on single-byte encodings
88       implemented as extended ASCII.  Most of them map \x80-\xff (upper half)
89       to non-ASCII characters.
90
91       ISO-8859 and corresponding vendor mappings
92         Since there are so many, they are presented in table format with
93         languages and corresponding encoding names by vendors.  Note that the
94         table is sorted in order of ISO-8859 and the corresponding vendor
95         mappings are slightly different from that of ISO.  See
96         <http://czyborra.com/charsets/iso8859.html> for details.
97
98           Lang/Regions  ISO/Other Std.  DOS     Windows Macintosh  Others
99           ----------------------------------------------------------------
100           N. America    (ASCII)         cp437        AdobeStandardEncoding
101                                         cp863 (DOSCanadaF)
102           W. Europe     iso-8859-1      cp850   cp1252  MacRoman  nextstep
103                                                                  hp-roman8
104                                         cp860 (DOSPortuguese)
105           Cntrl. Europe iso-8859-2      cp852   cp1250  MacCentralEurRoman
106                                                         MacCroatian
107                                                         MacRomanian
108                                                         MacRumanian
109           Latin3[1]     iso-8859-3
110           Latin4[2]     iso-8859-4
111           Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
112             (See also next section)     cp866           MacUkrainian
113           Arabic        iso-8859-6      cp864   cp1256  MacArabic
114                                         cp1006          MacFarsi
115           Greek         iso-8859-7      cp737   cp1253  MacGreek
116                                         cp869 (DOSGreek2)
117           Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
118           Turkish       iso-8859-9      cp857   cp1254  MacTurkish
119           Nordics       iso-8859-10     cp865
120                                         cp861           MacIcelandic
121                                                         MacSami
122           Thai          iso-8859-11[3]  cp874           MacThai
123           (iso-8859-12 is nonexistent. Reserved for Indics?)
124           Baltics       iso-8859-13     cp775           cp1257
125           Celtics       iso-8859-14
126           Latin9 [4]    iso-8859-15
127           Latin10       iso-8859-16
128           Vietnamese    viscii                  cp1258  MacVietnamese
129           ----------------------------------------------------------------
130
131           [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
132           [2] Baltics.  Now on 8859-10, except for Latvian.
133           [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
134           [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
135               letters that are missing from 8859-1 were added.
136
137         All cp* are also available as ibm-*, ms-*, and windows-* .  See also
138         <http://czyborra.com/charsets/codepages.html>.
139
140         Macintosh encodings don't seem to be registered in such entities as
141         IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
142         1150.  See <http://developer.apple.com/technotes/tn/tn1150.html> for
143         details.
144
145       KOI8 - De Facto Standard for the Cyrillic world
146         Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
147         popular in the Net.   Encode comes with the following KOI charsets.
148         For gory details, see <http://czyborra.com/charsets/cyrillic.html>
149
150           ----------------------------------------------------------------
151           koi8-f
152           koi8-r cp878                                           [RFC1489]
153           koi8-u                                                 [RFC2319]
154           ----------------------------------------------------------------
155
156   gsm0338 - Hentai Latin 1
157       GSM0338 is for GSM handsets. Though it shares alphanumerals with ASCII,
158       control character ranges and other parts are mapped very differently,
159       mainly to store Greek characters.  There are also escape sequences
160       (starting with 0x1B) to cover e.g. the Euro sign.
161
162       This was once handled by Encode::Bytes but because of all those unusual
163       specifications, Encode 2.20 has relocated the support to
164       Encode::GSM0338. See Encode::GSM0338 for details.
165
166       gsm0338 support before 2.19
167         Some special cases like a trailing 0x00 byte or a lone 0x1B byte are
168         not well-defined and decode() will return an empty string for them.
169         One possible workaround is
170
171            $gsm =~ s/\x00\z/\x00\x00/;
172            $uni = decode("gsm0338", $gsm);
173            $uni .= "\xA0" if $gsm =~ /\x1B\z/;
174
175         Note that the Encode implementation of GSM0338 does not implement the
176         reuse of Latin capital letters as Greek capital letters (for example,
177         the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK
178         CAPITAL LETTER ZETA).
179
180         The GSM0338 is also covered in Encode::Byte even though it is not an
181         "extended ASCII" encoding.
182
183   CJK: Chinese, Japanese, Korean (Multibyte)
184       Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
185       below.  Also note that these are implemented in distinct modules by
186       countries, due to the size concerns (simplified Chinese is mapped to
187       'CN', continental China, while traditional Chinese is mapped to 'TW',
188       Taiwan).  Please refer to their respective documentation pages.
189
190       Encode::CN -- Continental China
191           Standard      DOS/Win Macintosh                Comment/Reference
192           ----------------------------------------------------------------
193           euc-cn [1]            MacChineseSimp
194           (gbk)         cp936 [2]
195           gb12345-raw                      { GB12345 without CES }
196           gb2312-raw                       { GB2312  without CES }
197           hz
198           iso-ir-165
199           ----------------------------------------------------------------
200
201           [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
202           [2] gbk is aliased to this.  See L<Microsoft-related naming mess>
203
204       Encode::JP -- Japan
205           Standard      DOS/Win Macintosh                Comment/Reference
206           ----------------------------------------------------------------
207           euc-jp
208           shiftjis      cp932   macJapanese
209           7bit-jis
210           iso-2022-jp                                            [RFC1468]
211           iso-2022-jp-1                                          [RFC2237]
212           jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
213           jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }
214           jis0212-raw  { JIS X 0212 (Extended Kanji)         without CES }
215           ----------------------------------------------------------------
216
217       Encode::KR -- Korea
218           Standard      DOS/Win Macintosh                Comment/Reference
219           ----------------------------------------------------------------
220           euc-kr                MacKorean                        [RFC1557]
221                         cp949 [1]
222           iso-2022-kr                                            [RFC1557]
223           johab                                  [KS X 1001:1998, Annex 3]
224           ksc5601-raw                              { KSC5601 without CES }
225           ----------------------------------------------------------------
226
227           [1] ks_c_5601-1987, (x-)?windows-949, and uhc are aliased to this.
228           See below.
229
230       Encode::TW -- Taiwan
231           Standard      DOS/Win Macintosh                Comment/Reference
232           ----------------------------------------------------------------
233           big5-eten     cp950   MacChineseTrad {big5 aliased to big5-eten}
234           big5-hkscs
235           ----------------------------------------------------------------
236
237       Encode::HanExtra -- More Chinese via CPAN
238         Due to the size concerns, additional Chinese encodings below are
239         distributed separately on CPAN, under the name Encode::HanExtra.
240
241           Standard      DOS/Win Macintosh                Comment/Reference
242           ----------------------------------------------------------------
243           big5ext                                   CMEX's Big5e Extension
244           big5plus                                  CMEX's Big5+ Extension
245           cccii         Chinese Character Code for Information Interchange
246           euc-tw                             EUC (Extended Unix Character)
247           gb18030                          GBK with Traditional Characters
248           ----------------------------------------------------------------
249
250       Encode::JIS2K -- JIS X 0213 encodings via CPAN
251         Due to size concerns, additional Japanese encodings below are
252         distributed separately on CPAN, under the name Encode::JIS2K.
253
254           Standard      DOS/Win Macintosh                Comment/Reference
255           ----------------------------------------------------------------
256           euc-jisx0213
257           shiftjisx0123
258           iso-2022-jp-3
259           jis0213-1-raw
260           jis0213-2-raw
261           ----------------------------------------------------------------
262
263   Miscellaneous encodings
264       Encode::EBCDIC
265         See perlebcdic for details.
266
267           ----------------------------------------------------------------
268           cp37
269           cp500
270           cp875
271           cp1026
272           cp1047
273           posix-bc
274           ----------------------------------------------------------------
275
276       Encode::Symbols
277         For symbols  and dingbats.
278
279           ----------------------------------------------------------------
280           symbol
281           dingbats
282           MacDingbats
283           AdobeZdingbat
284           AdobeSymbol
285           ----------------------------------------------------------------
286
287       Encode::MIME::Header
288         Strictly speaking, MIME header encoding documented in RFC 2047 is
289         more of encapsulation than encoding.  However, their support in
290         modern world is imperative so they are supported.
291
292           ----------------------------------------------------------------
293           MIME-Header                                            [RFC2047]
294           MIME-B                                                 [RFC2047]
295           MIME-Q                                                 [RFC2047]
296           ----------------------------------------------------------------
297
298       Encode::Guess
299         This one is not a name of encoding but a utility that lets you pick
300         up the most appropriate encoding for a data out of given suspects.
301         See Encode::Guess for details.
302

Unsupported encodings

304       The following encodings are not supported as yet; some because they are
305       rarely used, some because of technical difficulties.  They may be
306       supported by external modules via CPAN in the future, however.
307
308       ISO-2022-JP-2 [RFC1554]
309         Not very popular yet.  Needs Unicode Database or equivalent to
310         implement encode() (because it includes JIS X 0208/0212, KSC5601, and
311         GB2312 simultaneously, whose code points in Unicode overlap.  So you
312         need to lookup the database to determine to what character set a
313         given Unicode character should belong).
314
315       ISO-2022-CN [RFC1922]
316         Not very popular.  Needs CNS 11643-1 and -2 which are not available
317         in this module.  CNS 11643 is supported (via euc-tw) in
318         Encode::HanExtra.  Audrey Tang may add support for this encoding in
319         her module in future.
320
321       Various HP-UX encodings
322         The following are unsupported due to the lack of mapping data.
323
324           '8'  - arabic8, greek8, hebrew8, kana8, thai8, and turkish8
325           '15' - japanese15, korean15, and roi15
326
327       Cyrillic encoding ISO-IR-111
328         Anton Tagunov doubts its usefulness.
329
330       ISO-8859-8-1 [Hebrew]
331         None of the Encode team knows Hebrew enough (ISO-8859-8, cp1255 and
332         MacHebrew are supported because and just because there were mappings
333         available at <http://www.unicode.org/>).  Contributions welcome.
334
335       ISIRI 3342, Iran System, ISIRI 2900 [Farsi]
336         Ditto.
337
338       Thai encoding TCVN
339         Ditto.
340
341       Vietnamese encodings VPS
342         Though Jungshik Shin has reported that Mozilla supports this
343         encoding, it was too late before 5.8.0 for us to add it.  In the
344         future, it may be available via a separate module.  See
345         <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.uf>
346         and
347         <http://lxr.mozilla.org/seamonkey/source/intl/uconv/ucvlatin/vps.ut>
348         if you are interested in helping us.
349
350       Various Mac encodings
351         The following are unsupported due to the lack of mapping data.
352
353           MacArmenian,  MacBengali,   MacBurmese,   MacEthiopic
354           MacExtArabic, MacGeorgian,  MacKannada,   MacKhmer
355           MacLaotian,   MacMalayalam, MacMongolian, MacOriya
356           MacSinhalese, MacTamil,     MacTelugu,    MacTibetan
357           MacVietnamese
358
359         The rest which are already available are based upon the vendor
360         mappings at <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/> .
361
362       (Mac) Indic encodings
363         The maps for the following are available at <http://www.unicode.org/>
364         but remain unsupported because those encodings need an algorithmical
365         approach, currently unsupported by enc2xs:
366
367           MacDevanagari
368           MacGurmukhi
369           MacGujarati
370
371         For details, please see "Unicode mapping issues and notes:" at
372         <http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT> .
373
374         I believe this issue is prevalent not only for Mac Indics but also in
375         other Indic encodings, but the above were the only Indic encodings
376         maps that I could find at <http://www.unicode.org/> .
377

Encoding vs. Charset -- terminology

379       We are used to using the term (character) encoding and character set
380       interchangeably.  But just as confusing the terms byte and character is
381       dangerous and the terms should be differentiated when needed, we need
382       to differentiate encoding and character set.
383
384       To understand that, here is a description of how we make computers grok
385       our characters.
386
387       • First we start with which characters to include.  We call this
388         collection of characters character repertoire.
389
390       • Then we have to give each character a unique ID so your computer can
391         tell the difference between 'a' and 'A'.  This itemized character
392         repertoire is now a character set.
393
394       • If your computer can grow the character set without further
395         processing, you can go ahead and use it.  This is called a coded
396         character set (CCS) or raw character encoding.  ASCII is used this
397         way for most cases.
398
399       • But in many cases, especially multi-byte CJK encodings, you have to
400         tweak a little more.  Your network connection may not accept any data
401         with the Most Significant Bit set, and your computer may not be able
402         to tell if a given byte is a whole character or just half of it.  So
403         you have to encode the character set to use it.
404
405         A character encoding scheme (CES) determines how to encode a given
406         character set, or a set of multiple character sets.  7bit ISO-2022 is
407         an example of a CES.  You switch between character sets via escape
408         sequences.
409
410       Technically, or mathematically, speaking, a character set encoded in
411       such a CES that maps character by character may form a CCS.  EUC is
412       such an example.  The CES of EUC is as follows:
413
414       • Map ASCII unchanged.
415
416       • Map such a character set that consists of 94 or 96 powered by N
417         members by adding 0x80 to each byte.
418
419       • You can also use 0x8e and 0x8f to indicate that the following
420         sequence of characters belongs to yet another character set.  To each
421         following byte is added the value 0x80.
422
423       By carefully looking at the encoded byte sequence, you can find that
424       the byte sequence conforms a unique number.  In that sense, EUC is a
425       CCS generated by a CES above from up to four CCS (complicated?).  UTF-8
426       falls into this category.  See "UTF-8" in perlUnicode to find out how
427       UTF-8 maps Unicode to a byte sequence.
428
429       You may also have found out by now why 7bit ISO-2022 cannot comprise a
430       CCS.  If you look at a byte sequence \x21\x21, you can't tell if it is
431       two !'s or IDEOGRAPHIC SPACE.  EUC maps the latter to \xA1\xA1 so you
432       have no trouble differentiating between "!!". and "  ".
433

Encoding Classification (by Anton Tagunov and Dan Kogai)

435       This section tries to classify the supported encodings by their
436       applicability for information exchange over the Internet and to choose
437       the most suitable aliases to name them in the context of such
438       communication.
439
440       • To (en|de)code encodings marked by "(**)", you need
441         "Encode::HanExtra", available from CPAN.
442
443       Encoding names
444
445         US-ASCII    UTF-8    ISO-8859-*  KOI8-R
446         Shift_JIS   EUC-JP   ISO-2022-JP ISO-2022-JP-1
447         EUC-KR      Big5     GB2312
448
449       are registered with IANA as preferred MIME names and may be used over
450       the Internet.
451
452       "Shift_JIS" has been officialized by JIS X 0208:1997.  "Microsoft-
453       related naming mess" gives details.
454
455       "GB2312" is the IANA name for "EUC-CN".  See "Microsoft-related naming
456       mess" for details.
457
458       "GB_2312-80" raw encoding is available as "gb2312-raw" with Encode. See
459       Encode::CN for details.
460
461         EUC-CN
462         KOI8-U        [RFC2319]
463
464       have not been registered with IANA (as of March 2002) but seem to be
465       supported by major web browsers.  The IANA name for "EUC-CN" is
466       "GB2312".
467
468         KS_C_5601-1987
469
470       is heavily misused.  See "Microsoft-related naming mess" for details.
471
472       "KS_C_5601-1987" raw encoding is available as "kcs5601-raw" with
473       Encode. See Encode::KR for details.
474
475         UTF-16 UTF-16BE UTF-16LE
476
477       are IANA-registered "charset"s. See [RFC 2781] for details.  Jungshik
478       Shin reports that UTF-16 with a BOM is well accepted by MS IE 5/6 and
479       NS 4/6. Beware however that
480
481       • "UTF-16" support in any software you're going to be
482         using/interoperating with has probably been less tested then "UTF-8"
483         support
484
485       • "UTF-8" coded data seamlessly passes traditional command piping
486         ("cat", "more", etc.) while "UTF-16" coded data is likely to cause
487         confusion (with its zero bytes, for example)
488
489       • it is beyond the power of words to describe the way HTML browsers
490         encode non-"ASCII" form data. To get a general impression, visit
491         <http://www.alanflavell.org.uk/charset/form-i18n.html>.  While
492         encoding of form data has stabilized for "UTF-8" encoded pages (at
493         least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
494         expect fun (and cross-browser discrepancies) with "UTF-16" encoded
495         pages!
496
497       The rule of thumb is to use "UTF-8" unless you know what you're doing
498       and unless you really benefit from using "UTF-16".
499
500         ISO-IR-165    [RFC1345]
501         VISCII
502         GB 12345
503         GB 18030 (**)  (see links below)
504         EUC-TW   (**)
505
506       are totally valid encodings but not registered at IANA.  The names
507       under which they are listed here are probably the most widely-known
508       names for these encodings and are recommended names.
509
510         BIG5PLUS (**)
511
512       is a proprietary name.
513
514   Microsoft-related naming mess
515       Microsoft products misuse the following names:
516
517       KS_C_5601-1987
518         Microsoft extension to "EUC-KR".
519
520         Proper names: "CP949", "UHC", "x-windows-949" (as used by Mozilla).
521
522         See
523         <http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
524         for details.
525
526         Encode aliases "KS_C_5601-1987" to "cp949" to reflect this common
527         misusage. Raw "KS_C_5601-1987" encoding is available as
528         "kcs5601-raw".
529
530         See Encode::KR for details.
531
532       GB2312
533         Microsoft extension to "EUC-CN".
534
535         Proper names: "CP936", "GBK".
536
537         "GB2312" has been registered in the "EUC-CN" meaning at IANA. This
538         has partially repaired the situation: Microsoft's "GB2312" has become
539         a superset of the official "GB2312".
540
541         Encode aliases "GB2312" to "euc-cn" in full agreement with IANA
542         registration. "cp936" is supported separately.  Raw "GB_2312-80"
543         encoding is available as "gb2312-raw".
544
545         See Encode::CN for details.
546
547       Big5
548         Microsoft extension to "Big5".
549
550         Proper name: "CP950".
551
552         Encode separately supports "Big5" and "cp950".
553
554       Shift_JIS
555         Microsoft's understanding of "Shift_JIS".
556
557         JIS has not endorsed the full Microsoft standard however.  The
558         official "Shift_JIS" includes only JIS X 0201 and JIS X 0208
559         character sets, while Microsoft has always used "Shift_JIS" to encode
560         a wider character repertoire. See "IANA" registration for
561         "Windows-31J".
562
563         As a historical predecessor, Microsoft's variant probably has more
564         rights for the name, though it may be objected that Microsoft
565         shouldn't have used JIS as part of the name in the first place.
566
567         Unambiguous name: "CP932". "IANA" name (also used by Mozilla, and
568         provided as an alias by Encode): "Windows-31J".
569
570         Encode separately supports "Shift_JIS" and "cp932".
571

Glossary

573       character repertoire
574         A collection of unique characters.  A character set in the strictest
575         sense. At this stage, characters are not numbered.
576
577       coded character set (CCS)
578         A character set that is mapped in a way computers can use directly.
579         Many character encodings, including EUC, fall in this category.
580
581       character encoding scheme (CES)
582         An algorithm to map a character set to a byte sequence.  You don't
583         have to be able to tell which character set a given byte sequence
584         belongs.  7-bit ISO-2022 is a CES but it cannot be a CCS.  EUC is an
585         example of being both a CCS and CES.
586
587       charset (in MIME context)
588         has long been used in the meaning of "encoding", CES.
589
590         While the word combination "character set" has lost this meaning in
591         MIME context since [RFC 2130], the "charset" abbreviation has
592         retained it. This is how [RFC 2277] and [RFC 2278] bless "charset":
593
594          This document uses the term "charset" to mean a set of rules for
595          mapping from a sequence of octets to a sequence of characters, such
596          as the combination of a coded character set and a character encoding
597          scheme; this is also what is used as an identifier in MIME "charset="
598          parameters, and registered in the IANA charset registry ...  (Note
599          that this is NOT a term used by other standards bodies, such as ISO).
600          [RFC 2277]
601
602       EUC
603         Extended Unix Character.  See ISO-2022.
604
605       ISO-2022
606         A CES that was carefully designed to coexist with ASCII.  There are a
607         7 bit version and an 8 bit version.
608
609         The 7 bit version switches character set via escape sequence so it
610         cannot form a CCS.  Since this is more difficult to handle in
611         programs than the 8 bit version, the 7 bit version is not very
612         popular except for iso-2022-jp, the de facto standard CES for
613         e-mails.
614
615         The 8 bit version can form a CCS.  EUC and ISO-8859 are two examples
616         thereof.  Pre-5.6 perl could use them as string literals.
617
618       UCS
619         Short for Universal Character Set.  When you say just UCS, it means
620         Unicode.
621
622       UCS-2
623         ISO/IEC 10646 encoding form: Universal Character Set coded in two
624         octets.
625
626       Unicode
627         A character set that aims to include all character repertoires of the
628         world.  Many character sets in various national as well as industrial
629         standards have become, in a way, just subsets of Unicode.
630
631       UTF
632         Short for Unicode Transformation Format.  Determines how to map a
633         Unicode character into a byte sequence.
634
635       UTF-16
636         A UTF in 16-bit encoding.  Can either be in big endian or little
637         endian.  The big endian version is called UTF-16BE (equal to UCS-2 +
638         surrogate support) and the little endian version is called UTF-16LE.
639

References

645       ECMA
646         European Computer Manufacturers Association <http://www.ecma.ch>
647
648         ECMA-035 (eq "ISO-2022")
649           <http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM>
650
651           The specification of ISO-2022 is available from the link above.
652
653       IANA
654         Internet Assigned Numbers Authority <http://www.iana.org/>
655
656         Assigned Charset Names by IANA
657           <http://www.iana.org/assignments/character-sets>
658
659           Most of the "canonical names" in Encode derive from this list so
660           you can directly apply the string you have extracted from MIME
661           header of mails and web pages.
662
663       ISO
664         International Organization for Standardization <http://www.iso.ch/>
665
666       RFC
667         Request For Comments -- need I say more?
668         <http://www.rfc-editor.org/>, <http://www.ietf.org/rfc.html>,
669         <http://www.faqs.org/rfcs/>
670
671       UC
672         Unicode Consortium <http://www.unicode.org/>
673
674         Unicode Glossary
675           <http://www.unicode.org/glossary/>
676
677           The glossary of this document is based upon this site.
678
679   Other Notable Sites
680       czyborra.com
681         <http://czyborra.com/>
682
683         Contains a lot of useful information, especially gory details of ISO
684         vs. vendor mappings.
685
686       CJK.inf
687         <http://examples.oreilly.com/cjkvinfo/doc/cjk.inf>
688
689         Somewhat obsolete (last update in 1996), but still useful.  Also try
690
691         <ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf>
692
693         You will find brief info on "EUC-CN", "GBK" and mostly on "GB 18030".
694
695       Jungshik Shin's Hangul FAQ
696         <http://jshin.net/faq>
697
698         And especially its subject 8.
699
700         <http://jshin.net/faq/qa8.html>
701
702         A comprehensive overview of the Korean ("KS *") standards.
703
704       debian.org: "Introduction to i18n"
705         A brief description for most of the mentioned CJK encodings is
706         contained in
707         <http://www.debian.org/doc/manuals/intro-i18n/ch-codes.en.html>
708
709   Offline sources
710       "CJKV Information Processing" by Ken Lunde
711         CJKV Information Processing 1999 O'Reilly & Associates, ISBN :
712         1-56592-224-7
713
714         The modern successor of "CJK.inf".
715
716         Features a comprehensive coverage of CJKV character sets and
717         encodings along with many other issues faced by anyone trying to
718         better support CJKV languages/scripts in all the areas of information
719         processing.
720
721         To purchase this book, visit
722         <http://oreilly.com/catalog/9780596514471/> or your favourite
723         bookstore.
724
725
726
727perl v5.34.0                      2021-10-10              Encode::Supported(3)