1UNICODE(7) Linux Programmer's Manual UNICODE(7)
2
3
4
6 unicode - universal character set
7
9 The international standard ISO 10646 defines the Universal Character
10 Set (UCS). UCS contains all characters of all other character set
11 standards. It also guarantees "round-trip compatibility"; in other
12 words, conversion tables can be built such that no information is lost
13 when a string is converted from any other encoding to UCS and back.
14
15 UCS contains the characters required to represent practically all known
16 languages. This includes not only the Latin, Greek, Cyrillic, Hebrew,
17 Arabic, Armenian, and Georgian scripts, but also Chinese, Japanese and
18 Korean Han ideographs as well as scripts such as Hiragana, Katakana,
19 Hangul, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu,
20 Kannada, Malayalam, Thai, Lao, Khmer, Bopomofo, Tibetan, Runic,
21 Ethiopic, Canadian Syllabics, Cherokee, Mongolian, Ogham, Myanmar, Sin‐
22 hala, Thaana, Yi, and others. For scripts not yet covered, research on
23 how to best encode them for computer usage is still going on and they
24 will be added eventually. This might eventually include not only Hi‐
25 eroglyphs and various historic Indo-European languages, but even some
26 selected artistic scripts such as Tengwar, Cirth, and Klingon. UCS
27 also covers a large number of graphical, typographical, mathematical,
28 and scientific symbols, including those provided by TeX, Postscript,
29 APL, MS-DOS, MS-Windows, Macintosh, OCR fonts, as well as many word
30 processing and publishing systems, and more are being added.
31
32 The UCS standard (ISO 10646) describes a 31-bit character set architec‐
33 ture consisting of 128 24-bit groups, each divided into 256 16-bit
34 planes made up of 256 8-bit rows with 256 column positions, one for
35 each character. Part 1 of the standard (ISO 10646-1) defines the first
36 65534 code positions (0x0000 to 0xfffd), which form the Basic Multilin‐
37 gual Plane (BMP), that is plane 0 in group 0. Part 2 of the standard
38 (ISO 10646-2) adds characters to group 0 outside the BMP in several
39 supplementary planes in the range 0x10000 to 0x10ffff. There are no
40 plans to add characters beyond 0x10ffff to the standard, therefore of
41 the entire code space, only a small fraction of group 0 will ever be
42 actually used in the foreseeable future. The BMP contains all charac‐
43 ters found in the commonly used other character sets. The supplemental
44 planes added by ISO 10646-2 cover only more exotic characters for spe‐
45 cial scientific, dictionary printing, publishing industry, higher-level
46 protocol and enthusiast needs.
47
48 The representation of each UCS character as a 2-byte word is referred
49 to as the UCS-2 form (only for BMP characters), whereas UCS-4 is the
50 representation of each character by a 4-byte word. In addition, there
51 exist two encoding forms UTF-8 for backward compatibility with ASCII
52 processing software and UTF-16 for the backward-compatible handling of
53 non-BMP characters up to 0x10ffff by UCS-2 software.
54
55 The UCS characters 0x0000 to 0x007f are identical to those of the clas‐
56 sic US-ASCII character set and the characters in the range 0x0000 to
57 0x00ff are identical to those in ISO 8859-1 (Latin-1).
58
59 Combining characters
60 Some code points in UCS have been assigned to combining characters.
61 These are similar to the nonspacing accent keys on a typewriter. A
62 combining character just adds an accent to the previous character. The
63 most important accented characters have codes of their own in UCS, how‐
64 ever, the combining character mechanism allows us to add accents and
65 other diacritical marks to any character. The combining characters al‐
66 ways follow the character which they modify. For example, the German
67 character Umlaut-A ("Latin capital letter A with diaeresis") can either
68 be represented by the precomposed UCS code 0x00c4, or alternatively as
69 the combination of a normal "Latin capital letter A" followed by a
70 "combining diaeresis": 0x0041 0x0308.
71
72 Combining characters are essential for instance for encoding the Thai
73 script or for mathematical typesetting and users of the International
74 Phonetic Alphabet.
75
76 Implementation levels
77 As not all systems are expected to support advanced mechanisms like
78 combining characters, ISO 10646-1 specifies the following three imple‐
79 mentation levels of UCS:
80
81 Level 1 Combining characters and Hangul Jamo (a variant encoding of
82 the Korean script, where a Hangul syllable glyph is coded as a
83 triplet or pair of vowel/consonant codes) are not supported.
84
85 Level 2 In addition to level 1, combining characters are now allowed
86 for some languages where they are essential (e.g., Thai, Lao,
87 Hebrew, Arabic, Devanagari, Malayalam).
88
89 Level 3 All UCS characters are supported.
90
91 The Unicode 3.0 Standard published by the Unicode Consortium contains
92 exactly the UCS Basic Multilingual Plane at implementation level 3, as
93 described in ISO 10646-1:2000. Unicode 3.1 added the supplemental
94 planes of ISO 10646-2. The Unicode standard and technical reports pub‐
95 lished by the Unicode Consortium provide much additional information on
96 the semantics and recommended usages of various characters. They pro‐
97 vide guidelines and algorithms for editing, sorting, comparing, normal‐
98 izing, converting, and displaying Unicode strings.
99
100 Unicode under Linux
101 Under GNU/Linux, the C type wchar_t is a signed 32-bit integer type.
102 Its values are always interpreted by the C library as UCS code values
103 (in all locales), a convention that is signaled by the GNU C library to
104 applications by defining the constant __STDC_ISO_10646__ as specified
105 in the ISO C99 standard.
106
107 UCS/Unicode can be used just like ASCII in input/output streams, termi‐
108 nal communication, plaintext files, filenames, and environment vari‐
109 ables in the ASCII compatible UTF-8 multibyte encoding. To signal the
110 use of UTF-8 as the character encoding to all applications, a suitable
111 locale has to be selected via environment variables (e.g.,
112 "LANG=en_GB.UTF-8").
113
114 The nl_langinfo(CODESET) function returns the name of the selected en‐
115 coding. Library functions such as wctomb(3) and mbsrtowcs(3) can be
116 used to transform the internal wchar_t characters and strings into the
117 system character encoding and back and wcwidth(3) tells, how many posi‐
118 tions (0–2) the cursor is advanced by the output of a character.
119
120 Private Use Areas (PUA)
121 In the Basic Multilingual Plane, the range 0xe000 to 0xf8ff will never
122 be assigned to any characters by the standard and is reserved for pri‐
123 vate usage. For the Linux community, this private area has been subdi‐
124 vided further into the range 0xe000 to 0xefff which can be used indi‐
125 vidually by any end-user and the Linux zone in the range 0xf000 to
126 0xf8ff where extensions are coordinated among all Linux users. The
127 registry of the characters assigned to the Linux zone is maintained by
128 LANANA and the registry itself is Documentation/admin-guide/unicode.rst
129 in the Linux kernel sources (or Documentation/unicode.txt before Linux
130 4.10).
131
132 Two other planes are reserved for private usage, plane 15 (Supplemen‐
133 tary Private Use Area-A, range 0xf0000 to 0xffffd) and plane 16 (Sup‐
134 plementary Private Use Area-B, range 0x100000 to 0x10fffd).
135
136 Literature
137 * Information technology — Universal Multiple-Octet Coded Character
138 Set (UCS) — Part 1: Architecture and Basic Multilingual Plane. In‐
139 ternational Standard ISO/IEC 10646-1, International Organization for
140 Standardization, Geneva, 2000.
141
142 This is the official specification of UCS . Available from
143 ⟨http://www.iso.ch/⟩.
144
145 * The Unicode Standard, Version 3.0. The Unicode Consortium, Addison-
146 Wesley, Reading, MA, 2000, ISBN 0-201-61633-5.
147
148 * S. Harbison, G. Steele. C: A Reference Manual. Fourth edition, Pren‐
149 tice Hall, Englewood Cliffs, 1995, ISBN 0-13-326224-3.
150
151 A good reference book about the C programming language. The fourth
152 edition covers the 1994 Amendment 1 to the ISO C90 standard, which
153 adds a large number of new C library functions for handling wide and
154 multibyte character encodings, but it does not yet cover ISO C99,
155 which improved wide and multibyte character support even further.
156
157 * Unicode Technical Reports.
158 ⟨http://www.unicode.org/reports/⟩
159
160 * Markus Kuhn: UTF-8 and Unicode FAQ for UNIX/Linux.
161 ⟨http://www.cl.cam.ac.uk/~mgk25/unicode.html⟩
162
163 * Bruno Haible: Unicode HOWTO.
164 ⟨http://www.tldp.org/HOWTO/Unicode-HOWTO.html⟩
165
167 locale(1), setlocale(3), charsets(7), utf-8(7)
168
170 This page is part of release 5.12 of the Linux man-pages project. A
171 description of the project, information about reporting bugs, and the
172 latest version of this page, can be found at
173 https://www.kernel.org/doc/man-pages/.
174
175
176
177GNU 2021-03-22 UNICODE(7)