1catdoc(1) General Commands Manual catdoc(1)
2
3
4
6 catdoc - reads MS-Word file and puts its content as plain text on stan‐
7 dard output
8
10 catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f out‐
11 put-format] file
12
13
15 catdoc behaves much like cat(1) but it reads MS-Word file and produces
16 human-readable text on standard output. Optionally it can use latex(1)
17 escape sequences for characters which have special meaning for LaTeX.
18 It also makes some effort to recognize MS-Word tables, although it
19 never tries to write correct headers for LaTeX tabular environment.
20 Additional output formats, such is HTML can be easily defined.
21
22 catdoc doesn't attempt to extract formatting information other than
23 tables from MS-Word document, so different output modes means mainly
24 that different characters should be escaped and different ways used to
25 represent characters, missing from output charset. See CHARACTER SUB‐
26 STITUTION below
27
28
29 catdoc uses internal unicode(4) representation of text, so it is able
30 to convert texts when charset in source document doesn't match charset
31 on target system. See CHARACTER SETS below.
32
33 If no file names supplied, catdoc processes its standard input unless
34 it is terminal. It is unlikely that somebody could type Word document
35 from keyboard, so if catdoc invoked without arguments and stdin is not
36 redirected, it prints brief usage message and exits. Processing of
37 standard input (even among other files) can be forced using dash '-' as
38 file name.
39
40 By default, catdoc wraps lines which are more than 72 chars long and
41 separates paragraphs by blank lines. This behavior can be turned of by
42 -w switch. In wide mode catdoc prints each paragraph as one long line,
43 suitable for import into word processors which perform word wrapping
44 theirselves.
45
46
47
49 -a - shortcut for -f ascii. Produces ASCII text as output. Sepa‐
50 rates table columns with TAB
51
52 -b - process broken MS-Word file. Normally, catdoc checks if first
53 8 bytes of file is Microsoft OLE signature. If so, it processes
54 file, otherwise it just copies it to stdin. It is intended to
55 use catdoc as filter for viewing all files with .doc extension.
56
57 -dcharset
58 - specifies destination charset name. Charset file has format
59 described in CHARACTER SETS below and should have .txt exten‐
60 sion and reside in catdoc library directory ( /usr/lib64/cat‐
61 doc). By default, current locale charset is used if langinfo
62 support compiled in.
63
64 -fformat
65 - specifies output format as described in CHARACTER SUBSTITU‐
66 TION below. catdoc comes with two output formats - ascii and
67 tex. You can add your own if you wish.
68
69 -l Causes catdoc to list names of available charsets to the stdout
70 and exit successfully.
71
72 -mnumber
73 Specifies right margin for text (default 72). -m 0 is equiva‐
74 lent to -w
75
76 -scharset
77 Specifies source charset. (one used in Word document), if Word
78 document doesn't contain UTF-16 text. When reading rtf docu‐
79 ments, it is typically not necessary, because rtf documents
80 contain ansicpg specification. But it can be set wrong by Word
81 (I've seen RTF documents on Russian, where cp1252 was speci‐
82 fied). In this case this option would take precedence over
83 charset, specified in the document. But source_charset state‐
84 ment in the configuration file have less priority than charset
85 in the document.
86
87 -t - shortcut for -f tex
88 converts all printable chars, which have special meaning for
89 LaTeX(1) into appropriate control sequences. Separates table
90 columns by &.
91
92 -u - declares that Word document contain UNICODE (UTF-16)
93 representation of text (as some Word-97 documents). If catdoc
94 fails to correct Word document with default charset, try
95 this option.
96
97 -8 - declares is Word document is 8 bit. Just in case that catdoc
98 recognizes file format incorrectly.
99
100 -w disables word wrapping. By default catdoc output is splitted
101 into lines not longer than 72 (or number, specified by -m
102 option) characters and paragraphs are separated by blank
103 line. With this option each paragraph is one long line.
104
105 -x causes catdoc to output unknown UNICODE character as \xNNNN,
106 instead of question marks.
107
108 -v causes catdoc to print some useless information about word doc‐
109 ument structure to stdout before actual start of text.
110
111 -V outputs catdoc version
112
113
115 When processing MS-Word file catdoc uses information about two charac‐
116 ter sets, typically different
117 - input and output. They are stored in plain text files in catdoc
118 library directory. Character set files should contain two whitespace-
119 separated hexadecimal numbers - 8-bit code in character set and 16-bit
120 Unicode code. Anything from hash mark to end of line is ignored, as
121 well as blank lines.
122
123 catdoc distribution includes some of these character sets. Additional
124 character set definitions, directly usable by catdoc can be obtained
125 from ftp.unicode.org. Charset files have .txt suffix, which shouldn't
126 be specified in command-line or configuration files.
127
128 Note that catdoc is distributed with Cyrillic charsets as default. If
129 you are not Russian, you probably don't want it, an should reconfigure
130 catdoc at compile time or in runtime configuration file.
131
132 When dealing with documents with charsets other than default, remember
133 that Microsoft never uses ISO charsets. While letters in, say cp1252
134 are at the same position as in ISO-8859-1, some punctuation signs would
135 be lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
136 catdoc would deal with those signs as described in CHARACTER SUBSTITU‐
137 TION below.
138
139
141 catdoc converts MS-Word file into following internal Unicode represen‐
142 tation:
143
144 1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
145
146 2. Table cells within row are separated by ASCII Field Separator symbol
147 (0x001C)
148
149 3. Table rows are separated by ASCII Record Separator (0x001E)
150
151 4. All printable characters, including whitespace are represented with
152 their
153 respective UNICODE codes.
154
155 This UNICODE representation is subsequently converted into 8-bit text
156 in target character set using following four-step algorithm:
157
158 1. List of special characters is searched for given Unicode character.
159 If found, then appropriate multi-character sequence is output
160 instead of character.
161
162 2. If there is an equivalent in target character set, it is output.
163
164 3. Otherwise, replacement list is searched and, if there is multi-char‐
165 acter
166 substitution for this UNICODE char, it is output.
167
168 4. If all above fails, "Unknown char" symbol (question mark) is output.
169
170 Lists of special characters and list of substitution are character set-
171 independent, because special chars should be escaped regardless of
172 their existence in target character set (usually, they are parts of
173 US-ASCII, and therefore exist in any character set) and replacement
174 list is searched only for those characters, which are not found in tar‐
175 get character set.
176
177 These lists are stored in catdoc library directory in files with prefix
178 of format name. These files have following format:
179
180 Each line can be either comment (starting with hash mark) or contain
181 hexadecimal UNICODE value, separated by whitespace from string, which
182 would be substituted instead of it. If string contain no whitespace it
183 can be used as is, otherwise it should be enclosed in single or double
184 quotes. Usual backslash sequences like '\n','\t' can be used in these
185 string.
186
187
188
190 Upon startup catdoc reads its system-wide configuration file ( catdocrc
191 in catdoc library directory) and then user-specific configuration file
192 ${HOME}/.catdocrc.
193
194 These files can contain following directives:
195
196 source_charset = charset-name
197 Sets default source charset, which would be used if no -s
198 option specified. Consult configuration of nearby windows work‐
199 station to find one you need.
200
201 target_charset = charset-name
202 Sets default output charset. You probably know, which one you
203 use.
204
205 charset_path = directory-list
206 colon-separated list of directories, which are searched for
207 charset files. This allows you to install additional charsets
208 in your home directory. If first directory component of path
209 is ~ it is replaced by contents of HOME environment variable.
210 On MS-DOS platform, if directory name starts with %s, it is
211 replaced with directory of executable file. Empty element in
212 list (i.e. two consequitve colons) is considered current direc‐
213 tory.
214
215 map_path = directory-list
216 colon-separated list of directories, which are searched for
217 special character map and replacement map. Same substitution
218 rules as in charset_path are applied.
219
220 format = format name
221 Output format which would be used by default. catdoc comes
222 with two formats - ascii and tex but nothing prevents you from
223 writing your own format (set two map files - special character
224 map and replacement map).
225
226 unknown_char = character specification
227 sets character to output instead of unknown Unicode character
228 (default '?') Character specification can have one of two form
229 - character enclosed in single quotes or hexadecimal code.
230
231 use_locale =(yes|no)
232 Enables or disables automatic selection of output charset
233 (default yes),
234 based on system locale settings (if enabled at compile time).
235 If automatic detection is enabled, than output charset settings
236 in the configuration files (but not in the command line) are
237 ignored, and current system locale charset is used instead.
238 There are no automatic choice of input charset, based of locale
239 language, because most modern Word files (since Word 97) are
240 Unicode anyway
241
242
244 Doesn't handle fast-saves properly. Prints footnotes as separate para‐
245 graphs at the end of file, instead of producing correct LaTeX commands.
246 Cannot distinguish between empty table cell and end of table row.
247
248
249
250
252 xls2csv(1), cat(1), strings(1), utf(4), unicode(4)
253
254
256 V.B.Wagner <vitus@45.free.net>
257
258
259
260MS-Word reader Version 0.94.2 catdoc(1)