1catdoc(1)                   General Commands Manual                  catdoc(1)
2
3
4

NAME

6       catdoc - reads MS-Word file and puts its content as plain text on stan‐
7       dard output
8

SYNOPSIS

10       catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f  out‐
11       put-format] file
12
13

DESCRIPTION

15       catdoc  behaves much like cat(1) but it reads MS-Word file and produces
16       human-readable text on standard output.  Optionally it can use latex(1)
17       escape  sequences  for characters which have special meaning for LaTeX.
18       It also makes some effort to  recognize  MS-Word  tables,  although  it
19       never  tries  to  write  correct headers for LaTeX tabular environment.
20       Additional output formats, such is HTML can be easily defined.
21
22       catdoc doesn't attempt to extract  formatting  information  other  than
23       tables  from  MS-Word  document, so different output modes means mainly
24       that different characters should be escaped and different ways used  to
25       represent  characters,  missing from output charset. See CHARACTER SUB‐
26       STITUTION below
27
28
29       catdoc uses internal unicode(4) representation of text, so it  is  able
30       to  convert texts when charset in source document doesn't match charset
31       on target system.  See CHARACTER SETS below.
32
33       If no file names supplied, catdoc processes its standard  input  unless
34       it  is  terminal. It is unlikely that somebody could type Word document
35       from keyboard, so if catdoc invoked without arguments and stdin is  not
36       redirected,  it  prints  brief  usage message and exits.  Processing of
37       standard input (even among other files) can be forced using dash '-' as
38       file name.
39
40       By  default,  catdoc  wraps lines which are more than 72 chars long and
41       separates paragraphs by blank lines. This behavior can be turned of  by
42       -w  switch. In wide mode catdoc prints each paragraph as one long line,
43       suitable for import into word processors that perform word wrapping.
44
45
46

OPTIONS

48       -a      - shortcut for -f ascii. Produces ASCII text as output.   Sepa‐
49               rates table columns with TAB
50
51       -b      - process broken MS-Word file. Normally, catdoc checks if first
52               8 bytes of file is Microsoft OLE signature. If so, it processes
53               file,  otherwise  it just copies it to stdin. It is intended to
54               use catdoc as filter for viewing all files with .doc extension.
55
56       -dcharset
57               - specifies destination charset name. Charset file  has  format
58               described  in  CHARACTER SETS below and should have .txt exten‐
59               sion  and reside in catdoc library directory (  /usr/lib64/cat‐
60               doc).  By  default,  current locale charset is used if langinfo
61               support compiled in.
62
63       -fformat
64               - specifies output format as described in  CHARACTER  SUBSTITU‐
65               TION  below.   catdoc comes with two output formats - ascii and
66               tex. You can add your own if you wish.
67
68       -l      Causes catdoc to list names of available charsets to the stdout
69               and exit successfully.
70
71       -mnumber
72               Specifies right margin for text  (default 72).  -m 0 is equiva‐
73               lent to -w
74
75       -scharset
76               Specifies source charset. (one used in Word document), if  Word
77               document  doesn't  contain UTF-16  text. When reading rtf docu‐
78               ments, it is typically not  necessary,  because  rtf  documents
79               contain  ansicpg specification. But it can be set wrong by Word
80               (I've seen RTF documents on Russian, where  cp1252  was  speci‐
81               fied).  In  this  case  this  option would take precedence over
82               charset, specified in the document. But  source_charset  state‐
83               ment  in the configuration file have less priority than charset
84               in the document.
85
86       -t      - shortcut for -f tex
87                converts all printable chars, which have special  meaning  for
88               LaTeX(1)  into  appropriate  control sequences. Separates table
89               columns by &.
90
91       -u      - declares that Word   document   contain   UNICODE    (UTF-16)
92               representation  of  text (as some Word-97 documents). If catdoc
93               fails to correct  Word document with   default  charset,    try
94               this  option.
95
96       -8      - declares is Word document is 8 bit. Just in case that catdoc
97                recognizes file format incorrectly.
98
99       -w      disables  word wrapping. By default catdoc output is split into
100               lines not longer than 72 (or  number, specified by -m   option)
101               characters  and  paragraphs  are  separated by blank line. With
102               this option each paragraph is one long line.
103
104       -x      causes catdoc to output unknown UNICODE  character  as  \xNNNN,
105               instead of question marks.
106
107       -v      causes catdoc to print some useless information about word doc‐
108               ument structure to stdout before actual start of text.
109
110       -V      outputs catdoc version
111
112

CHARACTER SETS

114       When processing MS-Word file catdoc uses information about two  charac‐
115       ter sets, typically different
116        -   input  and  output.  They are stored in plain text files in catdoc
117       library directory. Character set files should contain  two  whitespace-
118       separated  hexadecimal numbers - 8-bit code in character set and 16-bit
119       Unicode code.  Anything from hash mark to end of line  is  ignored,  as
120       well as blank lines.
121
122       catdoc  distribution  includes some of these character sets. Additional
123       character set definitions, directly usable by catdoc  can  be  obtained
124       from  ftp.unicode.org.  Charset files have .txt suffix, which shouldn't
125       be specified in command-line or configuration files.
126
127       Note that catdoc is distributed with Cyrillic charsets as  default.  If
128       you  are not Russian, you probably don't want it, an should reconfigure
129       catdoc at compile time or in runtime configuration file.
130
131       When dealing with documents with charsets other than default,  remember
132       that  Microsoft  never  uses ISO charsets. While letters in, say cp1252
133       are at the same position as in ISO-8859-1, some punctuation signs would
134       be lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
135       catdoc would deal with those signs as described in CHARACTER  SUBSTITU‐
136       TION below.
137
138

CHARACTER SUBSTITUTION

140       catdoc converts  MS-Word file into following internal Unicode represen‐
141       tation:
142
143       1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
144
145       2. Table cells within row are separated by ASCII Field Separator symbol
146           (0x001C)
147
148       3. Table rows are separated by ASCII Record Separator (0x001E)
149
150       4. All printable characters, including whitespace are represented  with
151       their
152           respective UNICODE codes.
153
154       This  UNICODE  representation is subsequently converted into 8-bit text
155       in target character set using following four-step algorithm:
156
157       1. List of special characters is searched for given Unicode character.
158           If found,  then  appropriate  multi-character  sequence  is  output
159           instead of character.
160
161       2. If there is an equivalent in target character set, it is output.
162
163       3. Otherwise, replacement list is searched and, if there is multi-char‐
164       acter
165           substitution for this UNICODE char, it is output.
166
167       4. If all above fails, "Unknown char" symbol (question mark) is output.
168
169       Lists of special characters and list of substitution are character set-
170       independent,  because  special  chars  should  be escaped regardless of
171       their existence in target character set  (usually, they  are  parts  of
172       US-ASCII,  and  therefore  exist  in any character set) and replacement
173       list is searched only for those characters, which are not found in tar‐
174       get character set.
175
176       These lists are stored in catdoc library directory in files with prefix
177       of format name. These files have following format:
178
179       Each line can be either comment (starting with hash  mark)  or  contain
180       hexadecimal  UNICODE  value, separated by whitespace from string, which
181       would be substituted instead of it. If string contain no whitespace  it
182       can  be used as is, otherwise it should be enclosed in single or double
183       quotes. Usual backslash sequences like '\n','\t' can be used  in  these
184       string.
185
186
187

RUNTIME CONFIGURATION

189       Upon startup catdoc reads its system-wide configuration file ( catdocrc
190       in catdoc library directory) and then user-specific configuration  file
191       ${HOME}/.catdocrc.
192
193       These files can contain following directives:
194
195       source_charset = charset-name
196               Sets  default  source  charset,  which  would  be used if no -s
197               option specified. Consult configuration of nearby windows work‐
198               station to find one you need.
199
200       target_charset = charset-name
201                Sets  default output charset. You probably know, which one you
202               use.
203
204       charset_path = directory-list
205               colon-separated list of directories,  which  are  searched  for
206               charset  files.  This allows you to install additional charsets
207               in your home directory.  If first directory component  of  path
208               is  ~  it is replaced by contents of HOME environment variable.
209               On MS-DOS platform, if directory name starts  with  %s,  it  is
210               replaced  with  directory  of executable file. Empty element in
211               list (i.e. two consequitve colons) is considered current direc‐
212               tory.
213
214       map_path = directory-list
215               colon-separated  list  of  directories,  which are searched for
216               special character map and replacement map.   Same  substitution
217               rules as in charset_path are applied.
218
219       format = format name
220               Output  format  which  would  be used by default.  catdoc comes
221               with two formats - ascii and tex but nothing prevents you  from
222               writing  your own format (set two map files - special character
223               map and replacement map).
224
225       unknown_char = character specification
226               sets character to output instead of unknown  Unicode  character
227               (default '?')  Character specification can have one of two form
228               - character enclosed in single quotes or hexadecimal code.
229
230       use_locale =(yes|no)
231               Enables or  disables  automatic  selection  of  output  charset
232               (default yes),
233                based  on system locale settings (if enabled at compile time).
234               If automatic detection is enabled, than output charset settings
235               in  the  configuration  files (but not in the command line) are
236               ignored, and current system locale  charset  is  used  instead.
237               There are no automatic choice of input charset, based of locale
238               language, because most modern Word files (since  Word  97)  are
239               Unicode anyway
240
241

BUGS

243       Doesn't  handle fast-saves properly. Prints footnotes as separate para‐
244       graphs at the end of file, instead of producing correct LaTeX commands.
245       Cannot distinguish between empty table cell and end of table row.
246
247
248
249

SEE ALSO

251       xls2csv(1), catppt(1), cat(1), strings(1), utf(4), unicode(4)
252
253

AUTHOR

255       V.B.Wagner <vitus@45.free.net>
256
257
258
259MS-Word reader             Version @catdoc_version@                  catdoc(1)
Impressum