1catdoc(1)                   General Commands Manual                  catdoc(1)
2
3
4

NAME

6       catdoc - reads MS-Word file and puts its content as plain text on stan‐
7       dard output
8

SYNOPSIS

10       catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f  out‐
11       put-format] file
12
13

DESCRIPTION

15       catdoc  behaves much like cat(1) but it reads MS-Word file and produces
16       human-readable text on standard output.  Optionally it can use latex(1)
17       escape  sequences  for characters which have special meaning for LaTeX.
18       It also makes some effort to  recognize  MS-Word  tables,  although  it
19       never  tries  to  write  correct headers for LaTeX tabular environment.
20       Additional output formats, such is HTML can be easily defined.
21
22       catdoc doesn't attempt to extract  formatting  information  other  than
23       tables  from  MS-Word  document, so different output modes means mainly
24       that different characters should be escaped and different ways used  to
25       represent  characters,  missing from output charset. See CHARACTER SUB‐
26       STITUTION below
27
28
29       catdoc uses internal unicode(4) representation of text, so it  is  able
30       to  convert texts when charset in source document doesn't match charset
31       on target system.  See CHARACTER SETS below.
32
33       If no file names supplied, catdoc processes its standard  input  unless
34       it  is  terminal. It is unlikely that somebody could type Word document
35       from keyboard, so if catdoc invoked without arguments and stdin is  not
36       redirected,  it  prints  brief  usage message and exits.  Processing of
37       standard input (even among other files) can be forced using dash '-' as
38       file name.
39
40       By  default,  catdoc  wraps lines which are more than 72 chars long and
41       separates paragraphs by blank lines. This behavior can be turned of  by
42       -w  switch. In wide mode catdoc prints each paragraph as one long line,
43       suitable for import into word processors which  perform  word  wrapping
44       theirselves.
45
46
47

OPTIONS

49       -a      -  shortcut for -f ascii. Produces ASCII text as output.  Sepa‐
50               rates table columns with TAB
51
52       -b      - process broken MS-Word file. Normally, catdoc checks if first
53               8 bytes of file is Microsoft OLE signature. If so, it processes
54               file, otherwise it just copies it to stdin. It is  intended  to
55               use catdoc as filter for viewing all files with .doc extension.
56
57       -dcharset
58               -  specifies  destination charset name. Charset file has format
59               described in CHARACTER SETS below and should have  .txt  exten‐
60               sion   and reside in catdoc library directory ( /usr/lib64/cat‐
61               doc). By default, current locale charset is  used  if  langinfo
62               support compiled in.
63
64       -fformat
65               -  specifies  output format as described in CHARACTER SUBSTITU‐
66               TION below.  catdoc comes with two output formats -  ascii  and
67               tex. You can add your own if you wish.
68
69       -l      Causes catdoc to list names of available charsets to the stdout
70               and exit successfully.
71
72       -mnumber
73               Specifies right margin for text  (default 72).  -m 0 is equiva‐
74               lent to -w
75
76       -scharset
77               Specifies  source charset. (one used in Word document), if Word
78               document doesn't contain UTF-16  text. When reading  rtf  docu‐
79               ments,  it  is  typically  not necessary, because rtf documents
80               contain ansicpg specification. But it can be set wrong by  Word
81               (I've  seen  RTF  documents on Russian, where cp1252 was speci‐
82               fied). In this case this  option  would  take  precedence  over
83               charset,  specified  in the document. But source_charset state‐
84               ment in the configuration file have less priority than  charset
85               in the document.
86
87       -t      - shortcut for -f tex
88                converts  all  printable chars, which have special meaning for
89               LaTeX(1) into appropriate control  sequences.  Separates  table
90               columns by &.
91
92       -u      -  declares  that  Word   document  contain  UNICODE   (UTF-16)
93               representation of text (as some Word-97 documents).  If  catdoc
94               fails  to  correct   Word document with  default charset,   try
95               this  option.
96
97       -8      - declares is Word document is 8 bit. Just in case that catdoc
98                recognizes file format incorrectly.
99
100       -w      disables word wrapping. By default catdoc  output  is  splitted
101               into  lines  not  longer  than  72 (or  number, specified by -m
102               option)   characters and  paragraphs  are  separated  by  blank
103               line. With this option each paragraph is one long line.
104
105       -x      causes  catdoc  to  output unknown UNICODE character as \xNNNN,
106               instead of question marks.
107
108       -v      causes catdoc to print some useless information about word doc‐
109               ument structure to stdout before actual start of text.
110
111       -V      outputs catdoc version
112
113

CHARACTER SETS

115       When  processing MS-Word file catdoc uses information about two charac‐
116       ter sets, typically different
117        -  input and output. They are stored in plain  text  files  in  catdoc
118       library  directory.  Character set files should contain two whitespace-
119       separated hexadecimal numbers - 8-bit code in character set and  16-bit
120       Unicode  code.   Anything  from hash mark to end of line is ignored, as
121       well as blank lines.
122
123       catdoc distribution includes some of these character  sets.  Additional
124       character  set  definitions,  directly usable by catdoc can be obtained
125       from ftp.unicode.org. Charset files have .txt suffix,  which  shouldn't
126       be specified in command-line or configuration files.
127
128       Note  that  catdoc is distributed with Cyrillic charsets as default. If
129       you are not Russian, you probably don't want it, an should  reconfigure
130       catdoc at compile time or in runtime configuration file.
131
132       When  dealing with documents with charsets other than default, remember
133       that Microsoft never uses ISO charsets. While letters  in,  say  cp1252
134       are at the same position as in ISO-8859-1, some punctuation signs would
135       be lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
136       catdoc  would deal with those signs as described in CHARACTER SUBSTITU‐
137       TION below.
138
139

CHARACTER SUBSTITUTION

141       catdoc converts  MS-Word file into following internal Unicode represen‐
142       tation:
143
144       1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
145
146       2. Table cells within row are separated by ASCII Field Separator symbol
147           (0x001C)
148
149       3. Table rows are separated by ASCII Record Separator (0x001E)
150
151       4.  All printable characters, including whitespace are represented with
152       their
153           respective UNICODE codes.
154
155       This UNICODE representation is subsequently converted into  8-bit  text
156       in target character set using following four-step algorithm:
157
158       1. List of special characters is searched for given Unicode character.
159           If  found,  then  appropriate  multi-character  sequence  is output
160           instead of character.
161
162       2. If there is an equivalent in target character set, it is output.
163
164       3. Otherwise, replacement list is searched and, if there is multi-char‐
165       acter
166           substitution for this UNICODE char, it is output.
167
168       4. If all above fails, "Unknown char" symbol (question mark) is output.
169
170       Lists of special characters and list of substitution are character set-
171       independent, because special chars  should  be  escaped  regardless  of
172       their  existence  in  target character set  (usually, they are parts of
173       US-ASCII, and therefore exist in any  character  set)  and  replacement
174       list is searched only for those characters, which are not found in tar‐
175       get character set.
176
177       These lists are stored in catdoc library directory in files with prefix
178       of format name. These files have following format:
179
180       Each  line  can  be either comment (starting with hash mark) or contain
181       hexadecimal UNICODE value, separated by whitespace from  string,  which
182       would  be substituted instead of it. If string contain no whitespace it
183       can be used as is, otherwise it should be enclosed in single or  double
184       quotes.  Usual  backslash sequences like '\n','\t' can be used in these
185       string.
186
187
188

RUNTIME CONFIGURATION

190       Upon startup catdoc reads its system-wide configuration file ( catdocrc
191       in  catdoc library directory) and then user-specific configuration file
192       ${HOME}/.catdocrc.
193
194       These files can contain following directives:
195
196       source_charset = charset-name
197               Sets default source charset, which  would  be  used  if  no  -s
198               option specified. Consult configuration of nearby windows work‐
199               station to find one you need.
200
201       target_charset = charset-name
202                Sets default output charset. You probably know, which one  you
203               use.
204
205       charset_path = directory-list
206               colon-separated  list  of  directories,  which are searched for
207               charset files.  This allows you to install additional  charsets
208               in  your  home directory.  If first directory component of path
209               is ~ it is replaced by contents of HOME  environment  variable.
210               On  MS-DOS  platform,  if  directory name starts with %s, it is
211               replaced with directory of executable file.  Empty  element  in
212               list (i.e. two consequitve colons) is considered current direc‐
213               tory.
214
215       map_path = directory-list
216               colon-separated list of directories,  which  are  searched  for
217               special  character  map and replacement map.  Same substitution
218               rules as in charset_path are applied.
219
220       format = format name
221               Output format which would be used  by  default.   catdoc  comes
222               with  two formats - ascii and tex but nothing prevents you from
223               writing your own format (set two map files - special  character
224               map and replacement map).
225
226       unknown_char = character specification
227               sets  character  to output instead of unknown Unicode character
228               (default '?')  Character specification can have one of two form
229               - character enclosed in single quotes or hexadecimal code.
230
231       use_locale =(yes|no)
232               Enables  or  disables  automatic  selection  of  output charset
233               (default yes),
234                based on system locale settings (if enabled at compile  time).
235               If automatic detection is enabled, than output charset settings
236               in the configuration files (but not in the  command  line)  are
237               ignored,  and  current  system  locale charset is used instead.
238               There are no automatic choice of input charset, based of locale
239               language,  because  most  modern Word files (since Word 97) are
240               Unicode anyway
241
242

BUGS

244       Doesn't handle fast-saves properly. Prints footnotes as separate  para‐
245       graphs at the end of file, instead of producing correct LaTeX commands.
246       Cannot distinguish between empty table cell and end of table row.
247
248
249
250

SEE ALSO

252       xls2csv(1), cat(1), strings(1), utf(4), unicode(4)
253
254

AUTHOR

256       V.B.Wagner <vitus@45.free.net>
257
258
259
260MS-Word reader                  Version 0.94.2                       catdoc(1)
Impressum