1UCONV(1) ICU 4.4.1 Manual UCONV(1)
2
3
4
6 uconv - convert data from one encoding to another
7
9 uconv [ -h, -?, --help ] [ -V, --version ] [ -s, --silent ] [ -v,
10 --verbose ] [ -l, --list | -l, --list-code code | --default-code | -L,
11 --list-transliterators ] [ --canon ] [ -x transliteration ] [
12 --to-callback callback | -c ] [ --from-callback callback | -i ] [
13 --callback callback ] [ --fallback | --no-fallback ] [ -b, --block-size
14 size ] [ -f, --from-code encoding ] [ -t, --to-code encoding ] [
15 --add-signature ] [ --remove-signature ] [ -o, --output file ] [
16 file... ]
17
19 uconv converts, or transcodes, each given file (or its standard input
20 if no file is specified) from one encoding to another. The transcoding
21 is done using Unicode as a pivot encoding (i.e. the data are first
22 transcoded from their original encoding to Unicode, and then from Uni‐
23 code to the destination encoding).
24
25 If an encoding is not specified or is -, the default encoding is used.
26 Thus, calling uconv with no encoding provides an easy way to validate
27 and sanitize data files for further consumption by tools requiring data
28 in the default encoding.
29
30 When calling uconv, it is possible to specify callbacks that are used
31 to handle invalid characters in the input, or characters that cannot be
32 transcoded to the destination encoding. Some encodings, for example,
33 offer a default substitution character that can be used to represent
34 the occurence of such characters in the input. Other callbacks offer a
35 useful visual representation of the invalid data.
36
37 uconv can also run the specified transliteration on the transcoded
38 data, in which case transliteration will happen as an intermediate
39 step, after the data have been transcoded to Unicode. The translitera‐
40 tion can be either a list of semicolon-separated transliterator names,
41 or an arbitrarily complex set of rules in the ICU transliteration rules
42 format.
43
44 For transcoding purposes, uconv options are compatible with those of
45 iconv(1), making it easy to replace it in scripts. It is not necessar‐
46 ily the case, however, that the encoding names used by uconv and ICU
47 are the same as the ones used by iconv(1). Also, options that provide
48 informational data, such as the -l, --list one offered by some iconv(1)
49 variants such as GNU's, produce data in a slightly different and easier
50 to parse format.
51
53 -h, -?, --help
54 Print help about usage and exit.
55
56 -V, --version
57 Print the version of uconv and exit.
58
59 -s, --silent
60 Suppress messages during execution.
61
62 -v, --verbose
63 Display extra informative messages during execution.
64
65 -l, --list
66 List all the available encodings and exit.
67
68 -l, --list-code code
69 List only the code encoding and exit. If code is not a proper
70 encoding, exit with an error.
71
72 --default-code
73 List only the name of the default encoding and exit.
74
75 -L, --list-transliterators
76 List all the available transliterators and exit.
77
78 --canon
79 If used with -l, --list or --default-code, the list of encodings
80 is produced in a format compatible with convrtrs.txt(5). If
81 used with -L, --list-transliterators, print only one transliter‐
82 ator name per line.
83
84 -x transliteration
85 Run the given transliteration on the transcoded Unicode data,
86 and use the transliterated data as input for the transcoding to
87 the the destination encoding.
88
89 --to-callback callback
90 Use callback to handle characters that cannot be transcoded to
91 the destination encoding. See section CALLBACKS for details on
92 valid callbacks.
93
94 -c Omit invalid characters from the output. Same as --to-callback
95 skip.
96
97 --from-callback callback
98 Use callback to handle characters that cannot be transcoded from
99 the original encoding. See section CALLBACKS for details on
100 valid callbacks.
101
102 -i Ignore invalid sequences in the input. Same as --from-callback
103 skip.
104
105 --callback callback
106 Use callback to handle both characters that cannot be transcoded
107 from the original encoding and characters that cannot be
108 transcoded to the destination encoding. See section CALLBACKS
109 for details on valid callbacks.
110
111 --fallback
112 Use the fallback mapping when transcoding from Unicode to the
113 destination encoding.
114
115 --no-fallback
116 Do not use the fallback mapping when transcoding from Unicode to
117 the destination encoding. This is the default.
118
119 -b, --block-size size
120 Read input in blocks of size bytes at a time. The default block
121 size is 4096.
122
123 -f, --from-code encoding
124 Set the original encoding of the data to encoding.
125
126 -t, --to-code encoding
127 Transcode the data to encoding.
128
129 --add-signature
130 Add a U+FEFF Unicode signature character (BOM) if the output
131 charset supports it and does not add one anyway.
132
133 --remove-signature
134 Remove a U+FEFF Unicode signature character (BOM).
135
136 -o, --output file
137 Write the transcoded data to file.
138
140 uconv supports specifying callbacks to handle invalid data. Callbacks
141 can be set for both directions of transcoding: from the original encod‐
142 ing to Unicode, with the --from-callback option, and from Unicode to
143 the destination encoding, with the --to-callback option.
144
145 The following is a list of valid callback names, alonmg with a descrip‐
146 tion of their behavior. The list of callbacks actually supported by
147 uconv is displayed when it is called with -h, --help.
148
149 substitute Write the the encoding's substitute sequence, or the
150 Unicode replacement character U+FFFD when transcoding
151 to Unicode.
152
153 skip Ignore the invalid data.
154
155 stop Stop with an error when encountering invalid data.
156 This is the default callback.
157
158 escape Same as escape-icu.
159
160 escape-icu Replace the missing characters with a string of the
161 format %Uhhhh for plane 0 characters, and %Uhhhh%Uhhhh
162 for planes 1 and above characters, where hhhh is the
163 hexadecimal value of one of the UTF-16 code units rep‐
164 resenting the character. Characters from planes 1 and
165 above are written as a pair of UTF-16 surrogate code
166 units.
167
168 escape-java Replace the missing characters with a string of the
169 format \uhhhh for plane 0 characters, and \uhhhh\uhhhh
170 for planes 1 and above characters, where hhhh is the
171 hexadecimal value of one of the UTF-16 code units rep‐
172 resenting the character. Characters from planes 1 and
173 above are written as a pair of UTF-16 surrogate code
174 units.
175
176 escape-c Replace the missing characters with a string of the
177 format \uhhhh for plane 0 characters, and \Uhhhhhhhh
178 for planes 1 and above characters, where hhhh and hhh‐
179 hhhhh are the hexadecimal values of the Unicode code‐
180 point.
181
182 escape-xml Same as escape-xml-hex.
183
184 escape-xml-hex Replace the missing characters with a string of the
185 format &#xhhhh;, where hhhh is the hexadecimal value
186 of the Unicode codepoint.
187
188 escape-xml-dec Replace the missing characters with a string of the
189 format &#xnnnn;, where nnnn is the decimal value of
190 the Unicode codepoint.
191
192 escape-unicode Replace the missing characters with a string of the
193 format {U+hhhh}, where hhhh is the hexadecimal value
194 of the Unicode codepoint. That hexadecimal string is
195 of variable length and can use from 4 to 6 digits.
196 This is the format universally used to denote a Uni‐
197 code codepoint in the litterature, delimited by curly
198 braces for easy recognition of those substitutions in
199 the output.
200
202 Convert data from a given encoding to the platform encoding:
203
204 $ uconv -f encoding
205
206 Check if a file contains valid data for a given encoding:
207
208 $ uconv -f encoding -c file >/dev/null
209
210 Convert a UTF-8 file to a given encoding and ensure that the resulting
211 text is good for any version of HTML:
212
213 $ uconv -f utf-8 -t encoding \
214 --callback escape-xml-dec file
215
216 Display the names of the Unicode code points in a UTF-file:
217
218 $ uconv -f utf-8 -x any-name file
219
220 Print the name of a Unicode code point whose value is known (U+30AB in
221 this example):
222
223 $ echo '\u30ab' | uconv -x 'hex-any; any-name'; echo
224 {KATAKANA LETTER KA}{LINE FEED}
225 $
226
227 (The names are delimited by curly braces. Also, the name of the line
228 terminator is also displayed.)
229
230 Normalize UTF-8 data using Unicode NFKC, remove all control characters,
231 and map Katakana to Hiragana:
232
233 $ uconv -f utf-8 -t utf-8 \
234 -x '::nfkc; [:Cc:] >; ::katakana-hiragana;'
235
237 uconv does report errors as occuring at the first invalid byte encoun‐
238 tered. This may be confusing to users of GNU iconv(1), which reports
239 errors as occuring at the first byte of an invalid sequence. For multi-
240 byte character sets or encodings, this means that uconv error positions
241 may be at a later offset in the input stream than would be the case
242 with GNU iconv(1).
243
244 The reporting of error positions when a transliterator is used may be
245 inaccurate or unavailable, in which case uconv will report the offset
246 in the output stream at which the error occured.
247
249 Jonas Utterstroem
250 Yves Arrouye
251
253 4.4.1
254
256 Copyright (C) 2000-2005 IBM, Inc. and others.
257
259 iconv(1)
260
261
262
263ICU MANPAGE 2005-jul-1 UCONV(1)