1PRECONV(1) General Commands Manual PRECONV(1)
2
3
4
6 preconv - convert encoding of input files to something GNU troff under‐
7 stands
8
10 preconv [-dr] [-e encoding] [files ...]
11 preconv -h | --help
12 preconv -v | --version
13
14 It is possible to have whitespace between the -e command line option
15 and its parameter.
16
18 preconv reads files and converts its encoding(s) to a form GNU troff(1)
19 can process, sending the data to standard output. Currently, this
20 means ASCII characters and `\[uXXXX]' entities, where `XXXX' is a hexa‐
21 decimal number with four to six digits, representing a Unicode input
22 code. Normally, preconv should be invoked with the -k and -K options
23 of groff.
24
26 -d Emit debugging messages to standard error (mainly the used
27 encoding).
28
29 -Dencoding
30 Specify default encoding if everything fails (see below).
31
32 -eencoding
33 Specify input encoding explicitly, overriding all other methods.
34 This corresponds to groff's -Kencoding option. Without this
35 switch, preconv uses the algorithm described below to select the
36 input encoding.
37
38 --help
39 -h Print help message.
40
41 -r Do not add .lf requests.
42
43 --version
44 -v Print version number.
45
47 preconv tries to find the input encoding with the following algorithm.
48
49 1. If the input encoding has been explicitly specified with option
50 -e, use it.
51
52 2. Otherwise, check whether the input starts with a Byte Order Mark
53 (BOM, see below). If found, use it.
54
55 3. Finally, check whether there is a known coding tag (see below)
56 in either the first or second input line. If found, use it.
57
58 4. If everything fails, use a default encoding as given with option
59 -D, by the current locale, or `latin1' if the locale is set to
60 `C', `POSIX', or empty (in that order).
61
62 Note that the groff program supports a GROFF_ENCODING environment vari‐
63 able which is eventually expanded to option -k.
64
65 Byte Order Mark
66 The Unicode Standard defines character U+FEFF as the Byte Order Mark
67 (BOM). On the other hand, value U+FFFE is guaranteed not be a Unicode
68 character at all. This allows to detect the byte order within the data
69 stream (either big-endian or lower-endian), and the MIME encodings
70 `UTF-16' and `UTF-32' mandate that the data stream starts with U+FEFF.
71 Similarly, the data stream encoded as `UTF-8' might start with a BOM
72 (to ease the conversion from and to UTF-16 and UTF-32). In all cases,
73 the byte order mark is not part of the data but part of the encoding
74 protocol; in other words, preconv's output doesn't contain it.
75
76 Note that U+FEFF not at the start of the input data actually is emit‐
77 ted; it has then the meaning of a `zero width no-break space' character
78 – something not needed normally in groff.
79
80 Coding Tags
81 Editors which support more than a single character encoding need tags
82 within the input files to mark the file's encoding. While it is possi‐
83 ble to guess the right input encoding with the help of heuristic algo‐
84 rithms for data which represents a greater amount of a natural lan‐
85 guage, it is still just a guess. Additionally, all algorithms fail
86 easily for input which is either too short or doesn't represent a natu‐
87 ral language.
88
89 For these reasons, preconv supports the coding tag convention (with
90 some restrictions) as used by GNU Emacs and XEmacs (and probably other
91 programs too).
92
93 Coding tags in GNU Emacs and XEmacs are stored in so-called File Vari‐
94 ables. preconv recognizes the following syntax form which must be put
95 into a troff comment in the first or second line.
96
97 -*- tag1: value1; tag2: value2; ... -*-
98
99 The only relevant tag for preconv is `coding' which can take the values
100 listed below. Here an example line which tells Emacs to edit a file in
101 troff mode, and to use latin2 as its encoding.
102
103 .\" -*- mode: troff; coding: latin-2 -*-
104
105 The following list gives all MIME coding tags (either lowercase or
106 uppercase) supported by preconv; this list is hard-coded in the source.
107
108 big5, cp1047, euc-jp, euc-kr, gb2312, iso-8859-1, iso-8859-2,
109 iso-8859-5, iso-8859-7, iso-8859-9, iso-8859-13, iso-8859-15,
110 koi8-r, us-ascii, utf-8, utf-16, utf-16be, utf-16le
111
112 In addition, the following hard-coded list of other tags is recognized
113 which eventually map to values from the list above.
114
115 ascii, chinese-big5, chinese-euc, chinese-iso-8bit, cn-big5,
116 cn-gb, cn-gb-2312, cp878, csascii, csisolatin1,
117 cyrillic-iso-8bit, cyrillic-koi8, euc-china, euc-cn, euc-japan,
118 euc-japan-1990, euc-korea, greek-iso-8bit, iso-10646/utf8,
119 iso-10646/utf-8, iso-latin-1, iso-latin-2, iso-latin-5,
120 iso-latin-7, iso-latin-9, japanese-euc, japanese-iso-8bit, jis8,
121 koi8, korean-euc, korean-iso-8bit, latin-0, latin1, latin-1,
122 latin-2, latin-5, latin-7, latin-9, mule-utf-8, mule-utf-16,
123 mule-utf-16be, mule-utf-16-be, mule-utf-16be-with-signature,
124 mule-utf-16le, mule-utf-16-le, mule-utf-16le-with-signature,
125 utf8, utf-16-be, utf-16-be-with-signature,
126 utf-16be-with-signature, utf-16-le, utf-16-le-with-signature,
127 utf-16le-with-signature
128
129 Those tags are taken from GNU Emacs and XEmacs, together with some
130 aliases. Trailing `-dos', `-unix', and `-mac' suffixes of coding tags
131 (which give the end-of-line convention used in the file) are stripped
132 off before the comparison with the above tags happens.
133
134 Iconv Issues
135 preconv by itself only supports three encodings: latin-1, cp1047, and
136 UTF-8; all other encodings are passed to the iconv library functions.
137 At compile time it is searched and checked for a valid iconv implemen‐
138 tation; a call to `preconv --version' shows whether iconv is used.
139
141 preconv doesn't support local variable lists yet. This is a different
142 syntax form to specify local variables at the end of a file.
143
145 groff(1)
146 the GNU Emacs and XEmacs info pages
147
148
149
150Groff Version 1.22.2 7 February 2013 PRECONV(1)