preconv(1)

1preconv(1)                  General Commands Manual                 preconv(1)
2
3
4

Name

6       preconv - prepare files for typesetting with groff
7

Synopsis

9       preconv [-dr] [-D fallback-encoding] [-e encoding] [file ...]
10
11       preconv -h
12       preconv --help
13
14       preconv -v
15       preconv --version
16

Description

18       preconv  reads  each  file,  converts  its encoded characters to a form
19       troff(1) can interpret, and sends the result  to  the  standard  output
20       stream.   Currently, this means that code points in the range 0–127 (in
21       US-ASCII, ISO 8859, or Unicode) remain as-is and the remainder are con‐
22       verted  to the groff special character form “\[uXXXX]”, where XXXX is a
23       hexadecimal number of four to six digits  corresponding  to  a  Unicode
24       code point.  By default, preconv also inserts a roff .lf request at the
25       beginning of each file, identifying it for the benefit  of  later  pro‐
26       cessing  (including diagnostic messages); the -r option suppresses this
27       behavior.
28
29       In typical usage scenarios, preconv need not be run  directly;  instead
30       it  should  be  invoked with the -k or -K options of groff.  If no file
31       operands are given on the command line, or if file is “-”, the standard
32       input stream is read.
33
34       preconv  tries to find the input encoding with the following algorithm,
35       stopping at the first success.
36
37       1.  If the input encoding has been explicitly specified with option -e,
38           use it.
39
40       2.  If  the  input starts with a Unicode Byte Order Mark, determine the
41           encoding as UTF-8, UTF-16, or UTF-32 accordingly.
42
43       3.  If the input stream is seekable, check the first and  second  input
44           lines  for  a  recognized GNU Emacs file-local variable identifying
45           the character encoding, here referred to as the  “coding  tag”  for
46           brevity.  If found, use it.
47
48       4.  If  the  input  stream  is seekable, and if the uchardet library is
49           available on the system, use it to try to infer the encoding of the
50           file.
51
52       5.  If the -D option specifies an encoding, use it.
53
54       6.  Use the encoding specified by the current locale (LC_CTYPE), unless
55           the locale is “C”, “POSIX”, or empty, in which case assume  Latin-1
56           (ISO 8859-1).
57
58       The  coding tag and uchardet methods in the above procedure rely upon a
59       seekable input stream; when preconv reads from a pipe,  the  stream  is
60       not  seekable,  and  these detection methods are skipped.  If character
61       encoding detection of your input files is unreliable, arrange  for  one
62       of the other methods to succeed by using preconv's -D or -e options, or
63       by configuring  your  locale  appropriately.   groff  also  supports  a
64       GROFF_ENCODING  environment variable, which can be overridden by its -K
65       option.  Valid values for (or parameters to) all of these  are  enumer‐
66       ated in the lists of recognized coding tags in the next subsection, and
67       are further influenced by iconv library support.
68
69   Coding tags
70       Text editors that support more than a single  character  encoding  need
71       tags  within  the input files to mark the file's encoding.  While it is
72       possible to guess the right input encoding with the help of  heuristics
73       that  are  reliable for a preponderance of natural language texts, they
74       are not absolutely reliable.  Heuristics can fail on  inputs  that  are
75       too short or don't represent a natural language.
76
77       Consequently,  preconv  supports  the  coding  tag  convention  used by
78       GNU Emacs (with some restrictions).  This notation appears in specially
79       marked regions of an input file designated for “file-local variables”.
80
81       preconv  interprets the following syntax if it occurs in a roff comment
82       in the first or second line of the input file.  Both “\"” and “\#” com‐
83       ment  forms are recognized, but the control (or no-break control) char‐
84       acter must be the default and must begin the line.  Similarly, the  es‐
85       cape character must be the default.
86              -*- [...;] coding: encoding[; ...] -*-
87
88       The  only  variable  preconv interprets is “coding”, which can take the
89       values listed below.
90
91       The following list comprises all MIME “charset” parameter values recog‐
92       nized, case-insensitively, by preconv.
93              big5,  cp1047,  euc-jp,  euc-kr, gb2312, iso-8859-1, iso-8859-2,
94              iso-8859-5, iso-8859-7,  iso-8859-9,  iso-8859-13,  iso-8859-15,
95              koi8-r, us-ascii, utf-8, utf-16, utf-16be, utf-16le
96
97       In  addition,  the  following  list of other coding tags is recognized,
98       each of which is mapped to an appropriate value from the list above.
99              ascii,  chinese-big5,  chinese-euc,  chinese-iso-8bit,  cn-big5,
100              cn-gb,      cn-gb-2312,     cp878,     csascii,     csisolatin1,
101              cyrillic-iso-8bit, cyrillic-koi8, euc-china, euc-cn,  euc-japan,
102              euc-japan-1990,   euc-korea,   greek-iso-8bit,   iso-10646/utf8,
103              iso-10646/utf-8,    iso-latin-1,    iso-latin-2,    iso-latin-5,
104              iso-latin-7, iso-latin-9, japanese-euc, japanese-iso-8bit, jis8,
105              koi8, korean-euc,  korean-iso-8bit,  latin-0,  latin1,  latin-1,
106              latin-2,  latin-5,  latin-7,  latin-9,  mule-utf-8, mule-utf-16,
107              mule-utf-16be,   mule-utf-16-be,   mule-utf-16be-with-signature,
108              mule-utf-16le,   mule-utf-16-le,   mule-utf-16le-with-signature,
109              utf8,            utf-16-be,            utf-16-be-with-signature,
110              utf-16be-with-signature,   utf-16-le,  utf-16-le-with-signature,
111              utf-16le-with-signature
112
113       Trailing “-dos”, “-unix”, and “-mac” suffixes on coding tags (which in‐
114       dicate the end-of-line convention used in the file) are disregarded for
115       the purpose of comparison with the above tags.
116
117   iconv support
118       While preconv recognizes all of the coding tags listed above, it is ca‐
119       pable  on  its  own of interpreting only three encodings: Latin-1, code
120       page 1047, and UTF-8.  If iconv support is configured at  compile  time
121       and available at run time, all others are passed to iconv library func‐
122       tions, which may recognize many additional encoding strings.  The  com‐
123       mand “preconv -v” discloses whether iconv support is configured.
124
125       The use of iconv means that characters in the input that encode invalid
126       code points for that encoding may be dropped from the output stream  or
127       mapped to the Unicode replacement character (U+FFFD).  Compare the fol‐
128       lowing examples using the input “café” (note the “e” with an acute  ac‐
129       cent), which due to its short length challenges inference of the encod‐
130       ing used.
131              printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
132              printf 'caf\351\n' | preconv -e us-ascii
133              printf 'caf\351\n' | preconv -e latin-1
134       The fate of the accented “e” differs  in  each  case.   In  the  first,
135       uchardet fails to detect an encoding (though the library on your system
136       may behave differently) and preconv falls back to the locale  settings,
137       where  octal 351 starts an incomplete UTF-8 sequence and results in the
138       Unicode replacement character.  In the  second,  it  is  not  a  repre‐
139       sentable  character  in  the declared input encoding of US-ASCII and is
140       discarded by iconv.  In the last, it is correctly detected and mapped.
141
142   Limitations
143       preconv cannot perform any transformation on input that it cannot  see.
144       Examples  include files that are interpolated by preprocessors that run
145       subsequently, including  soelim(1);  files  included  by  troff  itself
146       through  “so”  and  similar  requests; and string definitions passed to
147       troff through its -d command-line option.
148
149       preconv assumes that its input uses the  default  escape  character,  a
150       backslash \, and writes special character escape sequences accordingly.
151

Options

153       -h and --help display a usage message, while -v and --version show ver‐
154       sion information; all exit afterward.
155
156       -d     Emit debugging messages to the standard error stream.
157
158       -D fallback-encoding
159              Report fallback-encoding if all detection methods fail.
160
161       -e encoding
162              Skip detection and assume encoding; see groff's -K option.
163
164       -r     Write files “raw”; do not add .lf requests.
165

Name

Synopsis

Description

Options

See also