1UCONV(1)                        ICU 60.2 Manual                       UCONV(1)
2
3
4

NAME

6       uconv - convert data from one encoding to another
7

SYNOPSIS

9       uconv  [  -h,  -?,  --help  ]  [ -V, --version ] [ -s, --silent ] [ -v,
10       --verbose ] [ -l, --list | -l, --list-code code | --default-code |  -L,
11       --list-transliterators   ]  [  --canon  ]  [  -x  transliteration  ]  [
12       --to-callback callback | -c ] [  --from-callback  callback  |  -i  ]  [
13       --callback callback ] [ --fallback | --no-fallback ] [ -b, --block-size
14       size ] [ -f, --from-code encoding  ]  [  -t,  --to-code  encoding  ]  [
15       --add-signature  ]  [  --remove-signature  ]  [  -o,  --output file ] [
16       file...  ]
17

DESCRIPTION

19       uconv converts, or transcodes, each given file (or its  standard  input
20       if no file is specified) from one encoding to another.  The transcoding
21       is done using Unicode as a pivot encoding  (i.e.  the  data  are  first
22       transcoded  from their original encoding to Unicode, and then from Uni‐
23       code to the destination encoding).
24
25       If an encoding is not specified or is -, the default encoding is  used.
26       Thus,  calling  uconv with no encoding provides an easy way to validate
27       and sanitize data files for further consumption by tools requiring data
28       in the default encoding.
29
30       When  calling  uconv, it is possible to specify callbacks that are used
31       to handle invalid characters in the input, or characters that cannot be
32       transcoded  to  the  destination encoding. Some encodings, for example,
33       offer a default substitution character that can be  used  to  represent
34       the occurrence of such characters in the input. Other callbacks offer a
35       useful visual representation of the invalid data.
36
37       uconv can also run the  specified  transliteration  on  the  transcoded
38       data,  in  which  case  transliteration  will happen as an intermediate
39       step, after the data have been transcoded to Unicode.  The translitera‐
40       tion  can be either a list of semicolon-separated transliterator names,
41       or an arbitrarily complex set of rules in the ICU transliteration rules
42       format.
43
44       For  transcoding  purposes,  uconv options are compatible with those of
45       iconv(1), making it easy to replace it in scripts. It is not  necessar‐
46       ily  the  case,  however, that the encoding names used by uconv and ICU
47       are the same as the ones used by iconv(1).  Also, options that  provide
48       informational data, such as the -l, --list one offered by some iconv(1)
49       variants such as GNU's, produce data in a slightly different and easier
50       to parse format.
51

OPTIONS

53       -h, -?, --help
54              Print help about usage and exit.
55
56       -V, --version
57              Print the version of uconv and exit.
58
59       -s, --silent
60              Suppress messages during execution.
61
62       -v, --verbose
63              Display extra informative messages during execution.
64
65       -l, --list
66              List all the available encodings and exit.
67
68       -l, --list-code code
69              List  only  the  code encoding and exit. If code is not a proper
70              encoding, exit with an error.
71
72       --default-code
73              List only the name of the default encoding and exit.
74
75       -L, --list-transliterators
76              List all the available transliterators and exit.
77
78       --canon
79              If used with -l, --list or --default-code, the list of encodings
80              is  produced  in  a  format compatible with convrtrs.txt(5).  If
81              used with -L, --list-transliterators, print only one transliter‐
82              ator name per line.
83
84       -x transliteration
85              Run  the  given  transliteration on the transcoded Unicode data,
86              and use the transliterated data as input for the transcoding  to
87              the destination encoding.
88
89       --to-callback callback
90              Use  callback  to handle characters that cannot be transcoded to
91              the destination encoding. See section CALLBACKS for  details  on
92              valid callbacks.
93
94       -c     Omit  invalid characters from the output.  Same as --to-callback
95              skip.
96
97       --from-callback callback
98              Use callback to handle characters that cannot be transcoded from
99              the  original  encoding.  See  section  CALLBACKS for details on
100              valid callbacks.
101
102       -i     Ignore invalid sequences in the input.  Same as  --from-callback
103              skip.
104
105       --callback callback
106              Use callback to handle both characters that cannot be transcoded
107              from  the  original  encoding  and  characters  that  cannot  be
108              transcoded  to  the  destination encoding. See section CALLBACKS
109              for details on valid callbacks.
110
111       --fallback
112              Use the fallback mapping when transcoding from  Unicode  to  the
113              destination encoding.
114
115       --no-fallback
116              Do not use the fallback mapping when transcoding from Unicode to
117              the destination encoding.  This is the default.
118
119       -b, --block-size size
120              Read input in blocks of size bytes at a time. The default  block
121              size is 4096.
122
123       -f, --from-code encoding
124              Set the original encoding of the data to encoding.
125
126       -t, --to-code encoding
127              Transcode the data to encoding.
128
129       --add-signature
130              Add  a  U+FEFF  Unicode  signature character (BOM) if the output
131              charset supports it and does not add one anyway.
132
133       --remove-signature
134              Remove a U+FEFF Unicode signature character (BOM).
135
136       -o, --output file
137              Write the transcoded data to file.
138

CALLBACKS

140       uconv supports specifying callbacks to handle invalid  data.  Callbacks
141       can be set for both directions of transcoding: from the original encod‐
142       ing to Unicode, with the --from-callback option, and  from  Unicode  to
143       the destination encoding, with the --to-callback option.
144
145       The  following is a list of valid callback names, along with a descrip‐
146       tion of their behavior. The list of  callbacks  actually  supported  by
147       uconv is displayed when it is called with -h, --help.
148
149       substitute       Write  the encoding's substitute sequence, or the Uni‐
150                        code replacement character U+FFFD when transcoding  to
151                        Unicode.
152
153       skip             Ignore the invalid data.
154
155       stop             Stop  with  an  error  when encountering invalid data.
156                        This is the default callback.
157
158       escape           Same as escape-icu.
159
160       escape-icu       Replace the missing characters with a  string  of  the
161                        format %Uhhhh for plane 0 characters, and %Uhhhh%Uhhhh
162                        for planes 1 and above characters, where hhhh  is  the
163                        hexadecimal value of one of the UTF-16 code units rep‐
164                        resenting the character. Characters from planes 1  and
165                        above  are  written as a pair of UTF-16 surrogate code
166                        units.
167
168       escape-java      Replace the missing characters with a  string  of  the
169                        format \uhhhh for plane 0 characters, and \uhhhh\uhhhh
170                        for planes 1 and above characters, where hhhh  is  the
171                        hexadecimal value of one of the UTF-16 code units rep‐
172                        resenting the character. Characters from planes 1  and
173                        above  are  written as a pair of UTF-16 surrogate code
174                        units.
175
176       escape-c         Replace the missing characters with a  string  of  the
177                        format  \uhhhh  for plane 0 characters, and \Uhhhhhhhh
178                        for planes 1 and above characters, where hhhh and hhh‐
179                        hhhhh  are the hexadecimal values of the Unicode code‐
180                        point.
181
182       escape-xml       Same as escape-xml-hex.
183
184       escape-xml-hex   Replace the missing characters with a  string  of  the
185                        format  &#xhhhh;,  where hhhh is the hexadecimal value
186                        of the Unicode codepoint.
187
188       escape-xml-dec   Replace the missing characters with a  string  of  the
189                        format &#nnnn;, where nnnn is the decimal value of the
190                        Unicode codepoint.
191
192       escape-unicode   Replace the missing characters with a  string  of  the
193                        format  {U+hhhh},  where hhhh is the hexadecimal value
194                        of the Unicode codepoint.  That hexadecimal string  is
195                        of  variable  length  and  can use from 4 to 6 digits.
196                        This is the format universally used to denote  a  Uni‐
197                        code  codepoint  in the literature, delimited by curly
198                        braces for easy recognition of those substitutions  in
199                        the output.
200

EXAMPLES

202       Convert data from a given encoding to the platform encoding:
203
204           $ uconv -f encoding
205
206       Check if a file contains valid data for a given encoding:
207
208           $ uconv -f encoding -c file >/dev/null
209
210       Convert  a UTF-8 file to a given encoding and ensure that the resulting
211       text is good for any version of HTML:
212
213           $ uconv -f utf-8 -t encoding \
214               --callback escape-xml-dec file
215
216       Display the names of the Unicode code points in a UTF-file:
217
218           $ uconv -f utf-8 -x any-name file
219
220       Print the name of a Unicode code point whose value is known (U+30AB  in
221       this example):
222
223           $ echo '\u30ab' | uconv -x 'hex-any; any-name'; echo
224           {KATAKANA LETTER KA}{LINE FEED}
225           $
226
227       (The  names  are delimited by curly braces.  Also, the name of the line
228       terminator is also displayed.)
229
230       Normalize UTF-8 data using Unicode NFKC, remove all control characters,
231       and map Katakana to Hiragana:
232
233           $ uconv -f utf-8 -t utf-8 \
234                 -x '::nfkc; [:Cc:] >; ::katakana-hiragana;'
235

CAVEATS AND BUGS

237       uconv does report errors as occurring at the first invalid byte encoun‐
238       tered. This may be confusing to users of GNU  iconv(1),  which  reports
239       errors  as  occurring  at  the  first  byte of an invalid sequence. For
240       multi-byte character sets or encodings, this  means  that  uconv  error
241       positions  may  be  at a later offset in the input stream than would be
242       the case with GNU iconv(1).
243
244       The reporting of error positions when a transliterator is used  may  be
245       inaccurate  or  unavailable, in which case uconv will report the offset
246       in the output stream at which the error occurred.
247

AUTHORS

249       Jonas Utterstroem
250       Yves Arrouye
251

VERSION

253       60.2
254
256       Copyright (C) 2000-2005 IBM, Inc. and others.
257

SEE ALSO

259       iconv(1)
260
261
262
263ICU MANPAGE                       2005-jul-1                          UCONV(1)
Impressum