1TESSERACT(1) TESSERACT(1)
2
3
4
6 tesseract - command-line OCR engine
7
9 tesseract FILE OUTPUTBASE [OPTIONS]... [CONFIGFILE]...
10
12 tesseract(1) is a commercial quality OCR engine originally developed at
13 HP between 1985 and 1995. In 1995, this engine was among the top 3
14 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has
15 been developed at Google since then.
16
18 FILE
19 The name of the input file. This can either be an image file or a
20 text file.
21
22 Most image file formats (anything readable by Leptonica) are
23 supported.
24
25 A text file lists the names of all input images (one image name per
26 line). The results will be combined in a single file for each
27 output file format (txt, pdf, hocr, xml).
28
29 If FILE is stdin or - then the standard input is used.
30
31 OUTPUTBASE
32 The basename of the output file (to which the appropriate extension
33 will be appended). By default the output will be a text file with
34 .txt added to the basename unless there are one or more parameters
35 set which explicitly specify the desired output.
36
37 If OUTPUTBASE is stdout or - then the standard output is used.
38
40 -c CONFIGVAR=VALUE
41 Set value for parameter CONFIGVAR to VALUE. Multiple -c arguments
42 are allowed.
43
44 --dpi N
45 Specify the resolution N in DPI for the input image(s). A typical
46 value for N is 300. Without this option, the resolution is read
47 from the metadata included in the image. If an image does not
48 include that information, Tesseract tries to guess it.
49
50 -l LANG, -l SCRIPT
51 The language or script to use. If none is specified, eng (English)
52 is assumed. Multiple languages may be specified, separated by plus
53 characters. Tesseract uses 3-character ISO 639-2 language codes
54 (see LANGUAGES AND SCRIPTS).
55
56 --psm N
57 Set Tesseract to only run a subset of layout analysis and assume a
58 certain form of image. The options for N are:
59
60 0 = Orientation and script detection (OSD) only.
61 1 = Automatic page segmentation with OSD.
62 2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
63 3 = Fully automatic page segmentation, but no OSD. (Default)
64 4 = Assume a single column of text of variable sizes.
65 5 = Assume a single uniform block of vertically aligned text.
66 6 = Assume a single uniform block of text.
67 7 = Treat the image as a single text line.
68 8 = Treat the image as a single word.
69 9 = Treat the image as a single word in a circle.
70 10 = Treat the image as a single character.
71 11 = Sparse text. Find as much text as possible in no particular order.
72 12 = Sparse text with OSD.
73 13 = Raw line. Treat the image as a single text line,
74 bypassing hacks that are Tesseract-specific.
75
76 --oem N
77 Specify OCR Engine mode. The options for N are:
78
79 0 = Original Tesseract only.
80 1 = Neural nets LSTM only.
81 2 = Tesseract + LSTM.
82 3 = Default, based on what is available.
83
84 --tessdata-dir PATH
85 Specify the location of tessdata path.
86
87 --user-patterns FILE
88 Specify the location of user patterns file.
89
90 --user-words FILE
91 Specify the location of user words file.
92
93 CONFIGFILE
94 The name of a config to use. The name can be a file in
95 tessdata/configs or tessdata/tessconfigs, or an absolute or
96 relative file path. A config is a plain text file which contains a
97 list of parameters and their values, one per line, with a space
98 separating parameter from value.
99
100 Interesting config files include:
101
102 • alto — Output in ALTO format (OUTPUTBASE.xml).
103
104 • hocr — Output in hOCR format (OUTPUTBASE.hocr).
105
106 • pdf — Output PDF (OUTPUTBASE.pdf).
107
108 • tsv — Output TSV (OUTPUTBASE.tsv).
109
110 • txt — Output plain text (OUTPUTBASE.txt).
111
112 • get.images — Write processed input images to file
113 (OUTPUTBASE.processedPAGENUMBER.tif).
114
115 • logfile — Redirect debug messages to file (tesseract.log).
116
117 • lstm.train — Output files used by LSTM training
118 (OUTPUTBASE.lstmf).
119
120 • makebox — Write box file (OUTPUTBASE.box).
121
122 • quiet — Redirect debug messages to /dev/null.
123
124 It is possible to select several config files, for example tesseract
125 image.png demo alto hocr pdf txt will create four output files
126 demo.alto, demo.hocr, demo.pdf and demo.txt with the OCR results.
127
128 Nota bene: The options -l LANG, -l SCRIPT and --psm N must occur before
129 any CONFIGFILE.
130
132 -h, --help
133 Show help message.
134
135 --help-extra
136 Show extra help for advanced users.
137
138 --help-psm
139 Show page segmentation modes.
140
141 --help-oem
142 Show OCR Engine modes.
143
144 -v, --version
145 Returns the current version of the tesseract(1) executable.
146
147 --list-langs
148 List available languages for tesseract engine. Can be used with
149 --tessdata-dir PATH.
150
151 --print-parameters
152 Print tesseract parameters.
153
155 To recognize some text with Tesseract, it is normally necessary to
156 specify the language(s) or script(s) of the text (unless it is English
157 text which is supported by default) using -l LANG or -l SCRIPT.
158
159 Selecting a language automatically also selects the language specific
160 character set and dictionary (word list).
161
162 Selecting a script typically selects all characters of that script
163 which can be from different languages. The dictionary which is included
164 also contains a mix from different languages. In most cases, a script
165 also supports English. So it is possible to recognize a language that
166 has not been specifically trained for by using traineddata for the
167 script it is written in.
168
169 More than one language or script may be specified by using +. Example:
170 tesseract myimage.png myimage -l eng+deu+fra.
171
172 https://github.com/tesseract-ocr/tessdata_fast provides fast language
173 and script models which are also part of Linux distributions.
174
175 For Tesseract 4, tessdata_fast includes traineddata files for the
176 following languages:
177
178 afr (Afrikaans), amh (Amharic), ara (Arabic), asm (Assamese), aze
179 (Azerbaijani), aze_cyrl (Azerbaijani - Cyrilic), bel (Belarusian), ben
180 (Bengali), bod (Tibetan), bos (Bosnian), bre (Breton), bul (Bulgarian),
181 cat (Catalan; Valencian), ceb (Cebuano), ces (Czech), chi_sim (Chinese
182 simplified), chi_tra (Chinese traditional), chr (Cherokee), cos
183 (Corsican), cym (Welsh), dan (Danish), deu (German), div (Dhivehi), dzo
184 (Dzongkha), ell (Greek, Modern, 1453-), eng (English), enm (English,
185 Middle, 1100-1500), epo (Esperanto), equ (Math / equation detection
186 module), est (Estonian), eus (Basque), fas (Persian), fao (Faroese),
187 fil (Filipino), fin (Finnish), fra (French), frk (Frankish), frm
188 (French, Middle, ca.1400-1600), fry (West Frisian), gla (Scottish
189 Gaelic), gle (Irish), glg (Galician), grc (Greek, Ancient, to 1453),
190 guj (Gujarati), hat (Haitian; Haitian Creole), heb (Hebrew), hin
191 (Hindi), hrv (Croatian), hun (Hungarian), hye (Armenian), iku
192 (Inuktitut), ind (Indonesian), isl (Icelandic), ita (Italian), ita_old
193 (Italian - Old), jav (Javanese), jpn (Japanese), kan (Kannada), kat
194 (Georgian), kat_old (Georgian - Old), kaz (Kazakh), khm (Central
195 Khmer), kir (Kirghiz; Kyrgyz), kmr (Kurdish Kurmanji), kor (Korean),
196 kor_vert (Korean vertical), lao (Lao), lat (Latin), lav (Latvian), lit
197 (Lithuanian), ltz (Luxembourgish), mal (Malayalam), mar (Marathi), mkd
198 (Macedonian), mlt (Maltese), mon (Mongolian), mri (Maori), msa (Malay),
199 mya (Burmese), nep (Nepali), nld (Dutch; Flemish), nor (Norwegian), oci
200 (Occitan post 1500), ori (Oriya), osd (Orientation and script detection
201 module), pan (Panjabi; Punjabi), pol (Polish), por (Portuguese), pus
202 (Pushto; Pashto), que (Quechua), ron (Romanian; Moldavian; Moldovan),
203 rus (Russian), san (Sanskrit), sin (Sinhala; Sinhalese), slk (Slovak),
204 slv (Slovenian), snd (Sindhi), spa (Spanish; Castilian), spa_old
205 (Spanish; Castilian - Old), sqi (Albanian), srp (Serbian), srp_latn
206 (Serbian - Latin), sun (Sundanese), swa (Swahili), swe (Swedish), syr
207 (Syriac), tam (Tamil), tat (Tatar), tel (Telugu), tgk (Tajik), tha
208 (Thai), tir (Tigrinya), ton (Tonga), tur (Turkish), uig (Uighur;
209 Uyghur), ukr (Ukrainian), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek -
210 Cyrilic), vie (Vietnamese), yid (Yiddish), yor (Yoruba)
211
212 To use a non-standard language pack named foo.traineddata, set the
213 TESSDATA_PREFIX environment variable so the file can be found at
214 TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the
215 argument -l foo.
216
217 For Tesseract 4, tessdata_fast includes traineddata files for the
218 following scripts:
219
220 Arabic, Armenian, Bengali, Canadian_Aboriginal, Cherokee, Cyrillic,
221 Devanagari, Ethiopic, Fraktur, Georgian, Greek, Gujarati, Gurmukhi,
222 HanS (Han simplified), HanS_vert (Han simplified, vertical), HanT (Han
223 traditional), HanT_vert (Han traditional, vertical), Hangul,
224 Hangul_vert (Hangul vertical), Hebrew, Japanese, Japanese_vert
225 (Japanese vertical), Kannada, Khmer, Lao, Latin, Malayalam, Myanmar,
226 Oriya (Odia), Sinhala, Syriac, Tamil, Telugu, Thaana, Thai, Tibetan,
227 Vietnamese.
228
229 The same languages and scripts are available from
230 https://github.com/tesseract-ocr/tessdata_best. tessdata_best provides
231 slow language and script models. These models are needed for training.
232 They also can give better OCR results, but the recognition takes much
233 more time.
234
235 Both tessdata_fast and tessdata_best only support the LSTM OCR engine.
236
237 There is a third repository, https://github.com/tesseract-ocr/tessdata,
238 with models which support both the Tesseract 3 legacy OCR engine and
239 the Tesseract 4 LSTM OCR engine.
240
242 Tesseract config files consist of lines with parameter-value pairs
243 (space separated). The parameters are documented as flags in the source
244 code like the following one in tesseractclass.h:
245
246 STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to
247 recognize");
248
249 These parameters may enable or disable various features of the engine,
250 and may cause it to load (or not load) various data. For instance,
251 let’s suppose you want to OCR in English, but suppress the normal
252 dictionary and load an alternative word list and an alternative list of
253 patterns — these two files are the most commonly used extra data files.
254
255 If your language pack is in /path/to/eng.traineddata and the hocr
256 config is in /path/to/configs/hocr then create three new files:
257
258 /path/to/eng.user-words:
259
260 the
261 quick
262 brown
263 fox
264 jumped
265
266 /path/to/eng.user-patterns:
267
268 1-\d\d\d-GOOG-411
269 www.\n\\\*.com
270
271 /path/to/configs/bazaar:
272
273 load_system_dawg F
274 load_freq_dawg F
275 user_words_suffix user-words
276 user_patterns_suffix user-patterns
277
278 Now, if you pass the word bazaar as a CONFIGFILE to Tesseract,
279 Tesseract will not bother loading the system dictionary nor the
280 dictionary of frequent words and will load and use the eng.user-words
281 and eng.user-patterns files you provided. The former is a simple word
282 list, one per line. The format of the latter is documented in
283 dict/trie.h on read_pattern_list().
284
286 TESSDATA_PREFIX
287 If the TESSDATA_PREFIX is set to a path, then that path is used to
288 find the tessdata directory with language and script recognition
289 models and config files. Using --tessdata-dir PATH is the
290 recommended alternative.
291
292 OMP_THREAD_LIMIT
293 If the tesseract executable was built with multithreading support,
294 it will normally use four CPU cores for the OCR process. While this
295 can be faster for a single image, it gives bad performance if the
296 host computer provides less than four CPU cores or if OCR is made
297 for many images. Only a single CPU core is used with
298 OMP_THREAD_LIMIT=1.
299
301 The engine was developed at Hewlett Packard Laboratories Bristol and at
302 Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
303 more changes made in 1996 to port to Windows, and some C++izing in
304 1998. A lot of the code was written in C, and then some more was
305 written in C++. The C++ code makes heavy use of a list system using
306 macros. This predates STL, was portable before STL, and is more
307 efficient than STL lists, but has the big negative that if you do get a
308 segmentation violation, it is hard to debug.
309
310 Version 2.00 brought Unicode (UTF-8) support, six languages, and the
311 ability to train Tesseract.
312
313 Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy.
314 See https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf. Since
315 Tesseract 2.00, scripts are now included to allow anyone to reproduce
316 some of these tests. See
317 https://tesseract-ocr.github.io/tessdoc/TestingTesseract.html for more
318 details.
319
320 Tesseract 3.00 added a number of new languages, including Chinese,
321 Japanese, and Korean. It also introduced a new, single-file based
322 system of managing language data.
323
324 Tesseract 3.02 added BiDirectional text support, the ability to
325 recognize multiple languages in a single image, and improved layout
326 analysis.
327
328 Tesseract 4 adds a new neural net (LSTM) based OCR engine which is
329 focused on line recognition, but also still supports the legacy
330 Tesseract OCR engine of Tesseract 3 which works by recognizing
331 character patterns. Compatibility with Tesseract 3 is enabled by --oem
332 0. This also needs traineddata files which support the legacy engine,
333 for example those from the tessdata repository
334 (https://github.com/tesseract-ocr/tessdata).
335
336 For further details, see the release notes in the Tesseract
337 documentation
338 (https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html).
339
341 Main web site: https://github.com/tesseract-ocr User forum:
342 https://groups.google.com/g/tesseract-ocr Documentation:
343 https://tesseract-ocr.github.io/ Information on training:
344 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html
345
347 ambiguous_words(1), cntraining(1), combine_tessdata(1),
348 dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5),
349 unicharset(5), unicharset_extractor(1), wordlist2dawg(1)
350
352 Tesseract development was led at Hewlett-Packard and Google by Ray
353 Smith. The development team has included:
354
355 Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David
356 Eger, Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern
357 Wanke, Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil
358 Cheatle, Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith
359 Unnikrishnan, Raquel Romano, Ray Smith, Rika Antonova, Robert Moss,
360 Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.
361
362 For a list of contributors see
363 https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS.
364
366 Licensed under the Apache License, Version 2.0
367
368
369
370 02/25/2022 TESSERACT(1)