1enca(1)                                                                enca(1)
2
3
4

NAME

6       enca -- detect and convert encoding of text files
7

SYNOPSIS

9       enca [-L LANGUAGE] [OPTION]... [FILE]...
10       enconv [-L LANGUAGE] [OPTION]... [FILE]...
11

INTRODUCTION AND EXAMPLES

13       If you are lucky enough, the only two things you will ever need to know
14       are: command
15
16              enca FILE
17
18       will tell you which encoding file FILE uses (without changing it), and
19
20              enconv FILE
21
22       will convert file FILE to your locale native encoding.  To convert  the
23       file  to some other encoding use the -x option (see -x entry in section
24       OPTIONS and sections CONVERSION and ENCODINGS for details).
25
26       Both work with multiple files and standard input (output) too.  E.g.
27
28              enca -x latin2 <sometext | lpr
29
30       assures file `sometext' is in ISO Latin 2 when it's sent to printer.
31
32       The main reason why these command will fail and turn  your  files  into
33       garbage  is that Enca needs to know their language to detect the encod‐
34       ing.  It tries to determine your language and  preferred  charset  from
35       locale settings, which might not be what you want.
36
37       You can (or have to) use -L option to tell it the right language.  Sup‐
38       pose, you downloaded some Russian HTML file, `file.htm', it claims it's
39       windows-1251 but it isn't.  So you run
40
41              enca -L ru file.htm
42
43       and find out it's KOI8-R (for example).  Be warned, currently there are
44       not many supported languages (see section LANGUAGES).
45
46       Another warning concerns the fact several Enca's features,  namely  its
47       charset  conversion  capabilities,  strongly depend on what other tools
48       are installed on your system (see section CONVERSION)--run
49
50              enca --version
51
52       to get list of features (see section FEATURES).  Also try
53
54              enca --help
55
56       to get description of all other Enca options (and to find the  rest  of
57       this manual page redundant).
58

DESCRIPTION

60       Enca reads given text files, or standard input when none are given, and
61       uses knowledge about their language (must be supported by  you)  and  a
62       mixture  of  parsing, statistical analysis, guessing and black magic to
63       determine their encodings, which it then prints to standard output  (or
64       it  confesses it doesn't have any idea what the encoding could be).  By
65       default, Enca presents results as a multiline  human-readable  descrip‐
66       tions,  several  other formats are available--see Output type selectors
67       below.
68
69       Enca can also convert files to some other encoding ENC when you ask for
70       it--either  using  a built-in converter, some conversion library, or by
71       calling an external converter.
72
73       Enca's primary goal is to be usable unattended, as an automatic conver‐
74       sion  tool,  though  it perhaps have not reached this point yet (please
75       see section SECURITY).
76
77       Please note except rare cases Enca really has to know the  language  of
78       input  files  to give you a reliable answer.  On the other hand, it can
79       then cope quite well with files that are not  purely  textual  or  even
80       detect  charset  of text strings inside some binary file; of course, it
81       depends on the character of the non-text component.
82
83       Enca doesn't care about structure of input files, it views  them  as  a
84       uniform  piece  of  text/data.   In case of multipart files (e.g. mail‐
85       boxes), you have to use some tool knowing the structure to extract  the
86       individual  parts  first.  It's the cost of ability to detect encodings
87       of any damaged, incomplete or otherwise incorrect files.
88

OPTIONS

90       There are several categories of options: operation mode options, output
91       type  selectors,  guessing  parameters,  conversion parameters, general
92       options and listings.
93
94       All long options can be abbreviated as long as  they  are  unambiguous,
95       mandatory  parameters  of  long options are mandatory for short options
96       too.
97
98   Operation modes
99       are following:
100
101       -c, --auto-convert
102              Equivalent to calling Enca as enconv.
103
104              If no output type selector is specified, detect file  encodings,
105              guess  your preferred charset from locales, and convert files to
106              it (only available with +target-charset-auto feature).
107
108       -g, --guess
109              Equivalent to calling Enca as enca.
110
111              If no output type selector is specified, detect  file  encodings
112              and report them.
113
114   Output type selectors
115       select what action Enca will take when it determines the encoding; most
116       of them just choose between different names,  formats  and  conventions
117       how encodings can be printed, but one of them (-x) is special: it tells
118       Enca to recode files to some other encoding  ENC.   These  options  are
119       mutually  exclusive;  if you specify more than one output type selector
120       the last one takes precedence.
121
122       Several output types represent charset name used by some other program,
123       but not all these programs know all the charsets which Enca recognises.
124       Be warned, Enca makes no difference between  unrecognised  charset  and
125       charset having no name in given namespace in such situations.
126
127       -d, --details
128              It  used  to  print  a  few  pages of details about the guessing
129              process, but since Enca is just a program  linked  against  Enca
130              library, this is not possible and this option is roughly equiva‐
131              lent to --human-readable, except it reports failure reason  when
132              Enca doesn't recognize the encoding.
133
134       -e, --enca-name
135              Prints  Enca's  nice name of the charset, i.e., perhaps the most
136              generally accepted and more or less human-readable charset iden‐
137              tifier, with surfaces appended.
138
139              This name is used when calling an external converter, too.
140
141       -f, --human-readable
142              Prints  verbal  description  of  the  detected  charset and sur‐
143              faces--something a human understands best.  This is the  default
144              behaviour.
145
146              The precise format is following: the first line contains charset
147              name alone, and it's followed by zero  or  more  indented  lines
148              containing names of detected surfaces.  This format is not, how‐
149              ever, suitable or intended for further  machine-processing,  and
150              the  verbal  charset  descriptions  are  like  to  change in the
151              future.
152
153       -i, --iconv-name
154              Prints  how  iconv(3)  (and/or  iconv(1))  calls  the   detected
155              charset.   More precisely, it prints one, more or less arbitrar‐
156              ily chosen, alias accepted by iconv.  A charset unknown to iconv
157              counts as unknown.
158
159              This  output  type  makes  sense only when Enca is compiled with
160              iconv support (feature +iconv-interface).
161
162       -r, --rfc1345-name
163              Prints RFC 1345 charset name.  When such a  name  doesn't  exist
164              because  RFC  1345  doesn't  define a given encoding, some other
165              name defined in some other RFC or just  the  name  which  author
166              considers `the most canonical', is printed.
167
168              Since  RFC  1345  doesn't  define  surfaces,  no surface info is
169              appended.
170
171       -m, --mime-name
172              Prints preferred MIME name of detected  charset.   This  is  the
173              name you should normally use when fixing e-mails or web pages.
174
175              A charset not present in http://www.iana.org/assignments/charac
176              ter-sets counts as unknown.
177
178       -s, --cstocs-name
179              Prints how cstocs(1) calls  the  detected  charset.   A  charset
180              unknown to cstocs counts as unknown.
181
182       -n, --name=WORD
183              Prints charset (encoding) name selected by WORD (can be abbrevi‐
184              ated as long  as  is  unambiguous).   For  names  listed  above,
185              --name=WORD is equivalent to --WORD.
186
187              Using  aliases  as  the output type causes Enca to print list of
188              all accepted aliases of detected charset.
189
190       -x, --convert-to=[..]ENC
191              Converts file to encoding ENC.
192
193              The optional `..' before encoding name has no  special  meaning,
194              except  you  can  use  it  to  remind  yourself  that, unlike in
195              recode(1), you should specify desired encoding, instead of  cur‐
196              rent.
197
198              You  can  use  recode(1)  recoding  chains  or any other kind of
199              braindead recoding specification for ENC, provided that you tell
200              Enca  to use some tool understanding it for conversion (see sec‐
201              tion CONVERSION).
202
203              When Enca fails to determine the encoding, it prints  a  warning
204              and  leaves  the  the  file as is; when it is run as a filter it
205              tries to do its best to copy standard input to  standard  output
206              unchanged.   Nevertheless,  you  should  not  rely  on it and do
207              backup.
208
209   Guessing parameters
210       There's only one: -L setting language of input files.  This  option  is
211       mandatory (but see below).
212
213       -L, --language=LANG
214              Sets language of input files to LANG.
215
216              More precisely, LANG can be any valid locale name (or alias with
217              +locale-alias feature) of some supported language.  You can also
218              specify  `none'  as  language name, only multibyte encodings are
219              recognised then.  Run
220
221              enca --list languages
222
223              to get list of supported languages.  When you don't specify  any
224              language  Enca tries to guess your language from locale settings
225              and assumes input files use this  language.   See  section  LAN‐
226              GUAGES for details.
227
228   Conversion parameters
229       give  you  finer  control  of how charset conversion will be performed.
230       They don't affect anything when -x is not  specified  as  output  type.
231       Please see section CONVERSION for the gory conversion details.
232
233       -C, --try-converters=LIST
234              Appends comma separated LIST to the list of converters that will
235              be tried when you ask for conversion.  Their names can be abbre‐
236              viated as long as they are unambiguous.  Run
237
238              enca --list converters
239
240              to  get  list of all valid converter names (and see section CON‐
241              VERSION for their description).
242
243              The default list depends on how Enca has been compiled, run
244
245              enca --help
246
247              to find out default converter list.
248
249              Note the default list is used only when you don't specify -C  at
250              all.  Otherwise, the list is built as if it were initially empty
251              and every -C adds new converter(s) to it.  Moreover,  specifying
252              none as converter name causes clearing the converter list.
253
254       -E, --external-converter-program=PATH
255              Sets  external converter program name to PATH.  Default external
256              converter depends on how enca has been complied, and the  possi‐
257              bility  to  use external converters may not be available at all.
258              Run
259
260              enca --help
261
262              to find out default converter program in your enca build.
263
264   General options
265       don't fit to other option categories...
266
267       -p, --with-filename
268              Forces Enca to prefix each result with corresponding file  name.
269              By  default,  Enca  prefixes  results with filenames when run on
270              multiple files.
271
272              Standard input is printed as STDIN and standard output as STDOUT
273              (the latter can be probably seen in error messages only).
274
275       -P, --no-filename
276              Forces  Enca to not prefix results with file names.  By default,
277              Enca doesn't prefix result with file name when run on  a  single
278              file (including standard input).
279
280       -V, --verbose
281              Increases verbosity level (each use increases it by one).
282
283              Currently this option in not very useful because different parts
284              of Enca respond differently to the same verbosity level,  mostly
285              not at all.
286
287   Listings
288       are  all terminal, i.e. when Enca encounters some of them it prints the
289       required  listing  and  terminates  without  processing  any  following
290       options.
291
292       -h, --help
293              Prints brief usage help.
294
295       -G, --license
296              Prints full Enca license (through a pager, if possible).
297
298       -l, --list=WORD
299              Prints  list specified by WORD (can be abbreviated as long as it
300              is unambiguous).  Available lists include:
301
302              built-in-charsets.  All encodings convertible by  built-in  con‐
303              verter,  by  group  (both input and output encoding must be from
304              this list and belong to the same group for internal conversion).
305
306              built-in-encodings.  Equivalent to built-in-charsets,  but  con‐
307              sidered obsolete; will be accepted with a warning, for a while.
308
309              converters.  All valid converter names (to be used with -C).
310
311              charsets.   All encodings (charsets).  You can select what names
312              will be printed with --name or any name output type selector (of
313              course,  only encodings having a name in given namespace will be
314              printed then), the selector must be specified before --list.
315
316              encodings.  Equivalent to  charsets,  but  considered  obsolete;
317              will be accepted with a warning, for a while.
318
319              languages.   All  supported  languages  together  with  charsets
320              belonging to them.   Note  output  type  selects  language  name
321              style, not charset name style here.
322
323              names.  All possible values of --name option.
324
325              lists.  All possible values of this option.  (Crazy?)
326
327              surfaces.  All surfaces Enca recognises.
328
329       -v, --version
330              Prints  program  version  and list of features (see section FEA‐
331              TURES).
332

CONVERSION

334       Though Enca has been originally designed as a tool for guessing  encod‐
335       ing  only,  it now features several methods of charset conversion.  You
336       can control which of them will be used with -C.
337
338       Enca sequentially tries converters from the list specified by -C  until
339       it  finds  some that is able to perform required conversion or until it
340       exhausts the list.  You should specify preferred converters first, less
341       preferred  later.   External converter (extern) should be always speci‐
342       fied last, only as last resort, since  it's  usually  not  possible  to
343       recover  when  it  fails.  The default list of converters always starts
344       with built-in and then continues with the  first  one  available  from:
345       librecode, iconv, nothing.
346
347       It should be noted when Enca says it is not able to perform the conver‐
348       sion it only means none of the converters is able to  perform  it.   It
349       can  be  still  possible  to perform the required conversion in several
350       steps, using several converters, but to figure out how, human  intelli‐
351       gence is probably needed.
352
353   Built-in converter
354       is  the  simplest  and  far  the fastest of all, can perform only a few
355       byte-to-byte conversions and modifies files directly in place  (may  be
356       considered  dangerous,  but  is pretty efficient).  You can get list of
357       all encodings it can convert with
358
359              enca --list built-in
360
361       Beside speed, its main advantage (and also  disadvantage)  is  that  it
362       doesn't  care: it simply converts characters having a representation in
363       target encoding, doesn't touch anything else and never prints any error
364       message.
365
366       This converter can be specified as built-in with -C.
367
368   Librecode converter
369       is  an  interface  to GNU recode library, that does the actual recoding
370       job.  It may or may not be compiled in; run
371
372              enca --version
373
374       to find out its  availability  in  your  enca  build  (feature  +libre‐
375       code-interface).
376
377       You  should be familiar with recode(1) before using it, since recode is
378       a quite sophisticated and powerful charset conversion  tool.   You  may
379       run  into  problems  using  it  together with Enca particularly because
380       Enca's support for surfaces not 100% compatible, because  recode  tries
381       too  hard  to  make the transformation reversible, because it sometimes
382       silently ignores I/O errors, and because it's incredibly buggy.  Please
383       see GNU recode info pages for details about recode library.
384
385       This converter can be specified as librecode with -C.
386
387   Iconv converter
388       is  an  interface  to the UNIX98 iconv(3) conversion functions, that do
389       the actual recoding job.  It may or may not be compiled in; run
390
391              enca --version
392
393       to find out its availability in your enca build (feature  +iconv-inter‐
394       face).
395
396       While  iconv is present on most today systems it only rarely offer some
397       useful set of available conversions, the only notable  exception  being
398       iconv  from  GNU  libc.   It is usually quite picky about surfaces, too
399       (while, at the same time, not  implementing  surface  conversion).   It
400       however  probably  represents the only standard(ized) tool able to per‐
401       form conversion from/to Unicode.  Please see iconv documentation  about
402       for details about its capabilities on your particular system.
403
404       This converter can be specified as iconv with -C.
405
406   External converter
407       is  an arbitrary external conversion tool that can be specified with -E
408       option (at most one can be defined  simultaneously).   There  are  some
409       standard,  provided  together with enca: cstocs, recode, map, umap, and
410       piconv.  All are wrapper scripts:  for  cstocs(1),  recode(1),  map(1),
411       umap(1), and piconv(1).
412
413       Please  note enca has little control what the external converter really
414       does.  If you set it to /bin/rm you are fully responsible for the  con‐
415       sequences.
416
417       If  you  want  to  make your own converter to use with enca, you should
418       know it is always called
419
420              CONVERTER ENC_CURRENT ENC FILE [-]
421
422       where CONVERTER is what has been set by  -E,  ENC_CURRENT  is  detected
423       encoding,  ENC is what has been specified with -x, and FILE is the file
424       to convert, i.e. it is called for each file separately.   The  optional
425       fourth parameter, -, should cause (when present) sending result of con‐
426       version to standard output instead of overwriting the file  FILE.   The
427       converter  should  also  take  care  of  not changing file permissions,
428       returning error code 1 when it fails and cleaning its temporary  files.
429       Please see the standard external converters for examples.
430
431       This converter can be specified as extern with -C.
432
433   Default target charset
434       The  straightforward way of specifying target charset is the -x option,
435       which overrides any defaults.  When Enca is called as  enconv,  default
436       target charset is selected exactly the same way as recode(1) does it.
437
438       If  the  DEFAULT_CHARSET  environment variable is set, it's used as the
439       target charset.
440
441       Otherwise, if you system provides the nl_langinfo(3) function,  current
442       locale's native charset is used as the target charset.
443
444       When both methods fail, Enca complains and terminates.
445
446   Reversibility notes
447       If  reversibility  is  crucial  for you, you shouldn't use enca as con‐
448       verter at all (or  maybe  you  can,  with  very  specifically  designed
449       recode(1) wrapper).  Otherwise you should at least know that there four
450       basic means of handling inconvertible character entities:
451
452       fail--this is a possibility, too, and incidentally  it's  exactly  what
453       current  GNU libc iconv implementation does (recode can be also told to
454       do it)
455
456       don't touch them--this is what enca internal converter always does  and
457       recode  can  do;  though it is not reversible, a human being is usually
458       able to reconstruct the original (at least in principle)
459
460       approximate them--this is what cstocs can do, and  recode  too,  though
461       differently;  and the best choice if you just want to make the accursed
462       text readable
463
464       drop them out--this is what both recode and cstocs can do  (cstocs  can
465       also  replace  these characters by some fixed character instead of mere
466       ignoring); useful when the to-be-omitted characters contain only noise.
467
468       Please consult your favourite converter  manual  for  details  of  this
469       issue.   Generally, if you are not lucky enough to have all convertible
470       characters in you file, manual intervention is needed anyway.
471
472   Performance notes
473       Poor performance of available converters has been one of  main  reasons
474       for  including built-in converter in enca.  Try to use it whenever pos‐
475       sible, i.e. when files in consideration  are  charset-clean  enough  or
476       charset-messy  enough  so  that  its zero built-in intelligence doesn't
477       matter.  It requires no extra disk space nor extra memory and can  out‐
478       perform  recode(1)  more  than 10 times on large files and Perl version
479       (i.e. the faster one) of cstocs(1) more than 400 times on  small  files
480       (in fact it's almost as fast as mere cp(1)).
481
482       Try  to  avoid  external  converters when it's not absolutely necessary
483       since all the forking and moving stuff around is incredibly slow.
484

ENCODINGS

486       You can get list of recognised character sets with
487
488              enca --list charsets
489
490       and using --name parameter you can select any name you want to be  used
491       in the listing.  You can also list all surfaces with
492
493              enca --list surfaces
494
495       Encoding  and  surface  names are case insensitive and non-alphanumeric
496       characters are not taken into account.  However, non-alphanumeric char‐
497       acters  are mostly not allowed at all.  The only allowed are: `-', `_',
498       `.', `:', and `/' (as  charset/surface  separator).   So  `ibm852'  and
499       `IBM-852' are the same, while `IBM 852' is not accepted.
500
501   Charsets
502       Following list of recognised charsets uses Enca's names (-e) and verbal
503       descriptions as reported by Enca (-f):
504
505       ASCII         7bit ASCII characters
506       ISO-8859-2    ISO 8859-2 standard; ISO Latin 2
507       ISO-8859-4    ISO 8859-4 standard; Latin 4
508       ISO-8859-5    ISO 8859-5 standard; ISO Cyrillic
509       ISO-8859-13   ISO 8859-13 standard; ISO Baltic; Latin 7
510       ISO-8859-16   ISO 8859-16 standard
511       CP1125        MS-Windows code page 1125
512       CP1250        MS-Windows code page 1250
513       CP1251        MS-Windows code page 1251
514       CP1257        MS-Windows code page 1257; WinBaltRim
515       IBM852        IBM/MS code page 852; PC (DOS) Latin 2
516       IBM855        IBM/MS code page 855
517       IBM775        IBM/MS code page 775
518       IBM866        IBM/MS code page 866
519       baltic        ISO-IR-179; Baltic
520       KEYBCS2       Kamenicky encoding; KEYBCS2
521       macce         Macintosh Central European
522       maccyr        Macintosh Cyrillic
523       ECMA-113      Ecma Cyrillic; ECMA-113
524       KOI-8_CS_2    KOI8-CS2 code (`T602')
525       KOI8-R        KOI8-R Cyrillic
526       KOI8-U        KOI8-U Cyrillic
527       KOI8-UNI      KOI8-Unified Cyrillic
528       TeX           (La)TeX control sequences
529       UCS-2         Universal character set 2 bytes; UCS-2; BMP
530       UCS-4         Universal character set 4 bytes; UCS-4; ISO-10646
531       UTF-7         Universal transformation format 7 bits; UTF-7
532       UTF-8         Universal transformation format 8 bits; UTF-8
533       CORK          Cork encoding; T1
534       GBK           Simplified Chinese National Standard; GB2312
535       BIG5          Traditional Chinese Industrial Standard; Big5
536       HZ            HZ encoded GB2312
537       unknown       Unrecognized encoding
538
539       where unknown is not any real encoding, it's reported when Enca is  not
540       able to give a reliable answer.
541
542   Surfaces
543       Enca  has some experimental support for so-called surfaces (see below).
544       It detects following surfaces (not all can be applied to all charsets):
545
546       /CR     CR line terminators
547       /LF     LF line terminators
548       /CRLF   CRLF line terminators
549       N.A.    Mixed line terminators
550       N.A.    Surrounded by/intermixed with non-text data
551       /21     Byte order reversed in pairs (1,2 -> 2,1)
552       /4321   Byte order reversed in quadruples (1,2,3,4 -> 4,3,2,1)
553       N.A.    Both little and big endian chunks, concatenated
554       /qp     Quoted-printable encoded
555
556       Note some surfaces have N.A. in place  of  identifier--they  cannot  be
557       specified  on command line, they can only be reported by Enca.  This is
558       intentional because they only inform you why the file cannot be consid‐
559       ered surface-consistent instead of representing a real surface.
560
561       Each charset has its natural surface (called `implied' in recode) which
562       is not reported, e.g., for IBM 852 charset  it's  `CRLF  line  termina‐
563       tors'.  For UCS encodings, big endian is considered as natural surface;
564       unusual byte orders are constructed from 21 and 4321 permutations: 2143
565       is reported simply as 21, while 3412 is reported as combination of 4321
566       and 21.
567
568       Doubly-encoded  UTF-8  is  neither  charset  nor  surface,  it's   just
569       reported.
570
571   About charsets, encodings and surfaces
572       Charset  is a set of character entities while encoding is its represen‐
573       tation in the terms of bytes and bits.   In  Enca,  the  word  encoding
574       means  the  same as `representation of text', i.e. the relation between
575       sequence of character entities constituting the text  and  sequence  of
576       bytes (bits) constituting the file.
577
578       So, encoding is both character set and so-called surface (line termina‐
579       tors, byte order, combining, Base64 transformation,  etc.).   Neverthe‐
580       less, it proves convenient to work with some {charset,surface} pairs as
581       with genuine charsets.  So, as in recode(1), all UCS- and  UTF-  encod‐
582       ings of Universal character set are called charsets.  Please see recode
583       documentation for more details of this issue.
584
585       The only good thing about surfaces is: when  you  don't  start  playing
586       with  them,  neither Enca won't start and it will try to behave as much
587       as possible as a surface-unaware program, even when talking to recode.
588

LANGUAGES

590       Enca needs to know the language of input files  to  work  reliably,  at
591       least  in case of regular 8bit encoding.  Multibyte encodings should be
592       recognised for any Latin, Cyrillic or Greek language.
593
594       You can (or have to) use -L option to tell Enca  the  language.   Since
595       people  most  often work with files in the same language for which they
596       have configured locales, Enca tries tries  to  guess  the  language  by
597       examining  value  of  LC_CTYPE  and other locale categories (please see
598       locale(7)) and using it for the language when you  don't  specify  any.
599       Of  course,  it  may  be  completely  wrong  and will give you nonsense
600       answers and damage your files, so please don't forget  to  use  the  -L
601       option.  You can also use ENCAOPT environment variable to set a default
602       language (see section ENVIRONMENT).
603
604       Following languages are supported by  Enca  (each  language  is  listed
605       together with supported 8bit encodings).
606
607       Belarusian    CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855
608       Bulgarian     CP1251 ISO-8859-5 IBM855 maccyr ECMA-113
609       Czech         ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
610       Estonian      ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic
611       Croatian      CP1250 ISO-8859-2 IBM852 macce CORK
612       Hungarian     ISO-8859-2 CP1250 IBM852 macce CORK
613       Lithuanian    CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
614
615       Latvian       CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic
616       Polish        ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK
617       Russian       KOI8-R CP1251 ISO-8859-5 IBM866 maccyr
618       Slovak        CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK
619       Slovene       ISO-8859-2 CP1250 IBM852 macce CORK
620       Ukrainian     CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr
621       Chinese       GBK BIG5 HZ
622       none
623
624       The  special  language none can be shortened to __, it contains no 8bit
625       encodings, so only multibyte encodings are detected.
626
627       You can also use locale names instead of languages:
628
629       Belarusian      be
630       Bulgarian       bg
631       Czech           cs
632       Estonian        et
633       Croatian        hr
634       Hungarian       hu
635       Lithuanian      lt
636       Latvian         lv
637       Polish          pl
638       Russian         ru
639       Slovak          sk
640       Slovene         sl
641       Ukrainian       uk
642       Chinese         zh
643

FEATURES

645       Several Enca's features depend on what is available on your system  and
646       how it was compiled.  You can get their list with
647
648              enca --version
649
650       Plus  sign before a feature name means it's available, minus sign means
651       this build lacks the particular feature.
652
653       librecode-interface.  Enca has interface to GNU recode library  charset
654       conversion functions.
655
656       iconv-interface.  Enca has interface to UNIX98 iconv charset conversion
657       functions.
658
659       external-converter.  Enca can use external conversion programs (if  you
660       have some suitable installed).
661
662       language-detection.   Enca  tries  to guess language (-L) from locales.
663       You don't need the --language option, at least in principle.
664
665       locale-alias.  Enca is able to decrypt locale aliases used for language
666       names.
667
668       target-charset-auto.   Enca tries to detect your preferred charset from
669       locales.  Option --auto-convert and calling Enca as  enconv  works,  at
670       least in principle.
671
672       ENCAOPT.   Enca  is  able  to correctly parse this environment variable
673       before command line parameters.  Simple stuff like ENCAOPT="-L uk" will
674       work even without this feature.
675

ENVIRONMENT

677       The variable ENCAOPT can hold set of default Enca options.  Its content
678       is interpreted before  command  line  arguments.   Unfortunately,  this
679       doesn't work everywhere (must have +ENCAOPT feature).
680
681       LC_CTYPE,  LC_COLLATE,  LC_MESSAGES  (possibly inherited from LC_ALL or
682       LANG) is used for guessing your language (must have +language-detection
683       feature).
684
685       The  variable DEFAULT_CHARSET can be used by enconv as the default tar‐
686       get charset.
687

DIAGNOSTICS

689       Enca returns exit code 0 when all input files  were  successfully  pro‐
690       ceeded  (i.e.  all encodings were detected and all files were converted
691       to required encoding, if conversion was asked for).   Exit  code  1  is
692       returned when Enca wasn't able to either guess encoding or perform con‐
693       version on any input file because it's not clever enough.  Exit code  2
694       is returned in case of serious (e.g. I/O) troubles.
695

SECURITY

697       It  should be possible to let Enca work unattended, it's its goal. How‐
698       ever:
699
700       There's no warranty the detection works 100%. Don't bet on it, you  can
701       easily lose valuable data.
702
703       Don't  use enca (the program), link to libenca instead if you want any‐
704       thing resembling security. You have to perform the eventual  conversion
705       yourself then.
706
707       Don't use external converters. Ideally, disable them compile-time.
708
709       Be  aware  of  ENCAOPT  and all the built-in automagic guessing various
710       things from environment, namely locales.
711

SEE ALSO

713       autoconvert(1), cstocs(1), file(1), iconv(1), iconv(3), nl_langinfo(3),
714       map(1),  piconv(1),  recode(1),  locale(5), locale(7), ltt(1), umap(1),
715       unicode(7), utf-8(7), xcode(1)
716

KNOWN BUGS

718       It has too many unknown bugs.
719
720       The idea of using LC_* value for language is certainly braindead.  How‐
721       ever I like it.
722
723       It can't backup files before mangling them.
724
725       In certain situations, it may behave incorrectly on >31bit file systems
726       and/or over NFS (both untested but shouldn't cause  problems  in  prac‐
727       tice).
728
729       Built-in  converter  does not convert character `ch' from KOI8-CS2, and
730       possibly some other characters you've probably never heard  about  any‐
731       way.
732
733       EOL  type  recognition  works poorly on Quoted-printable encoded files.
734       This should be fixed someday.
735
736       There are no command line options to tune libenca parameters.  This  is
737       intentional (Enca should DWIM) but sometimes this is a nuisance.
738
739       The  manual  page  is  too long, especially this section.  This doesn't
740       matter since nobody does read it.
741
742       Send bug reports to <https://github.com/nijel/enca/issues>.
743

TRIVIA

745       Enca is Extremely Naive  Charset  Analyser.   Nevertheless,  the  `enc'
746       originally  comes  from `encoding' so the leading `e' should be read as
747       in `encoding' not as in `extreme'.
748

AUTHORS

750       David Necas (Yeti) <yeti@physics.muni.cz>
751
752       Michal Cihar <michal@cihar.com>
753
754       Unicode data has been generated from various (free)  on-line  resources
755       or  using GNU recode.  Statistical data has been generated from various
756       texts on the Net, I hope  character  counting  doesn't  break  anyone's
757       copyright.
758

ACKNOWLEDGEMENTS

760       Please see the file THANKS in distribution.
761
763       Copyright (C) 2000-2003 David Necas (Yeti).
764
765       Copyright (C) 2009 Michal Cihar <michal@cihar.com>.
766
767       Enca  is  free software; you can redistribute it and/or modify it under
768       the terms of version 2 of the GNU General Public License  as  published
769       by the Free Software Foundation.
770
771       Enca is distributed in the hope that it will be useful, but WITHOUT ANY
772       WARRANTY; without even the implied warranty of MERCHANTABILITY or  FIT‐
773       NESS  FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
774       more details.
775
776       You should have received a copy of the GNU General Public License along
777       with  Enca;  if  not,  write to the Free Software Foundation, Inc., 675
778       Mass Ave, Cambridge, MA 02139, USA.
779
780
781
782
783enca 1.11                          Sep 2009                            enca(1)
Impressum