1TESSERACT(1)                                                      TESSERACT(1)
2
3
4

NAME

6       tesseract - command-line OCR engine
7

SYNOPSIS

9       tesseract FILE OUTPUTBASE [OPTIONS]... [CONFIGFILE]...
10

DESCRIPTION

12       tesseract(1) is a commercial quality OCR engine originally developed at
13       HP between 1985 and 1995. In 1995, this engine was among the top 3
14       evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has
15       been developed at Google since then.
16

IN/OUT ARGUMENTS

18       FILE
19           The name of the input file. This can either be an image file or a
20           text file.
21
22           Most image file formats (anything readable by Leptonica) are
23           supported.
24
25           A text file lists the names of all input images (one image name per
26           line). The results will be combined in a single file for each
27           output file format (txt, pdf, hocr, xml).
28
29           If FILE is stdin or - then the standard input is used.
30
31       OUTPUTBASE
32           The basename of the output file (to which the appropriate extension
33           will be appended). By default the output will be a text file with
34           .txt added to the basename unless there are one or more parameters
35           set which explicitly specify the desired output.
36
37           If OUTPUTBASE is stdout or - then the standard output is used.
38

OPTIONS

40       -c CONFIGVAR=VALUE
41           Set value for parameter CONFIGVAR to VALUE. Multiple -c arguments
42           are allowed.
43
44       --dpi N
45           Specify the resolution N in DPI for the input image(s). A typical
46           value for N is 300. Without this option, the resolution is read
47           from the metadata included in the image. If an image does not
48           include that information, Tesseract tries to guess it.
49
50       -l LANG, -l SCRIPT
51           The language or script to use. If none is specified, eng (English)
52           is assumed. Multiple languages may be specified, separated by plus
53           characters. Tesseract uses 3-character ISO 639-2 language codes
54           (see LANGUAGES AND SCRIPTS).
55
56       --psm N
57           Set Tesseract to only run a subset of layout analysis and assume a
58           certain form of image. The options for N are:
59
60               0 = Orientation and script detection (OSD) only.
61               1 = Automatic page segmentation with OSD.
62               2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
63               3 = Fully automatic page segmentation, but no OSD. (Default)
64               4 = Assume a single column of text of variable sizes.
65               5 = Assume a single uniform block of vertically aligned text.
66               6 = Assume a single uniform block of text.
67               7 = Treat the image as a single text line.
68               8 = Treat the image as a single word.
69               9 = Treat the image as a single word in a circle.
70               10 = Treat the image as a single character.
71               11 = Sparse text. Find as much text as possible in no particular order.
72               12 = Sparse text with OSD.
73               13 = Raw line. Treat the image as a single text line,
74                    bypassing hacks that are Tesseract-specific.
75
76       --oem N
77           Specify OCR Engine mode. The options for N are:
78
79               0 = Original Tesseract only.
80               1 = Neural nets LSTM only.
81               2 = Tesseract + LSTM.
82               3 = Default, based on what is available.
83
84       --tessdata-dir PATH
85           Specify the location of tessdata path.
86
87       --user-patterns FILE
88           Specify the location of user patterns file.
89
90       --user-words FILE
91           Specify the location of user words file.
92
93       CONFIGFILE
94           The name of a config to use. The name can be a file in
95           tessdata/configs or tessdata/tessconfigs, or an absolute or
96           relative file path. A config is a plain text file which contains a
97           list of parameters and their values, one per line, with a space
98           separating parameter from value.
99
100           Interesting config files include:
101
102alto — Output in ALTO format (OUTPUTBASE.xml).
103
104hocr — Output in hOCR format (OUTPUTBASE.hocr).
105
106pdf — Output PDF (OUTPUTBASE.pdf).
107
108tsv — Output TSV (OUTPUTBASE.tsv).
109
110txt — Output plain text (OUTPUTBASE.txt).
111
112get.images — Write processed input images to file
113               (OUTPUTBASE.processedPAGENUMBER.tif).
114
115logfile — Redirect debug messages to file (tesseract.log).
116
117lstm.train — Output files used by LSTM training
118               (OUTPUTBASE.lstmf).
119
120makebox — Write box file (OUTPUTBASE.box).
121
122quiet — Redirect debug messages to /dev/null.
123
124       It is possible to select several config files, for example tesseract
125       image.png demo alto hocr pdf txt will create four output files
126       demo.alto, demo.hocr, demo.pdf and demo.txt with the OCR results.
127
128       Nota bene: The options -l LANG, -l SCRIPT and --psm N must occur before
129       any CONFIGFILE.
130

SINGLE OPTIONS

132       -h, --help
133           Show help message.
134
135       --help-extra
136           Show extra help for advanced users.
137
138       --help-psm
139           Show page segmentation modes.
140
141       --help-oem
142           Show OCR Engine modes.
143
144       -v, --version
145           Returns the current version of the tesseract(1) executable.
146
147       --list-langs
148           List available languages for tesseract engine. Can be used with
149           --tessdata-dir PATH.
150
151       --print-parameters
152           Print tesseract parameters.
153

LANGUAGES AND SCRIPTS

155       To recognize some text with Tesseract, it is normally necessary to
156       specify the language(s) or script(s) of the text (unless it is English
157       text which is supported by default) using -l LANG or -l SCRIPT.
158
159       Selecting a language automatically also selects the language specific
160       character set and dictionary (word list).
161
162       Selecting a script typically selects all characters of that script
163       which can be from different languages. The dictionary which is included
164       also contains a mix from different languages. In most cases, a script
165       also supports English. So it is possible to recognize a language that
166       has not been specifically trained for by using traineddata for the
167       script it is written in.
168
169       More than one language or script may be specified by using +. Example:
170       tesseract myimage.png myimage -l eng+deu+fra.
171
172       https://github.com/tesseract-ocr/tessdata_fast provides fast language
173       and script models which are also part of Linux distributions.
174
175       For Tesseract 4, tessdata_fast includes traineddata files for the
176       following languages:
177
178       afr (Afrikaans), amh (Amharic), ara (Arabic), asm (Assamese), aze
179       (Azerbaijani), aze_cyrl (Azerbaijani - Cyrilic), bel (Belarusian), ben
180       (Bengali), bod (Tibetan), bos (Bosnian), bre (Breton), bul (Bulgarian),
181       cat (Catalan; Valencian), ceb (Cebuano), ces (Czech), chi_sim (Chinese
182       simplified), chi_tra (Chinese traditional), chr (Cherokee), cos
183       (Corsican), cym (Welsh), dan (Danish), deu (German), div (Dhivehi), dzo
184       (Dzongkha), ell (Greek, Modern, 1453-), eng (English), enm (English,
185       Middle, 1100-1500), epo (Esperanto), equ (Math / equation detection
186       module), est (Estonian), eus (Basque), fas (Persian), fao (Faroese),
187       fil (Filipino), fin (Finnish), fra (French), frk (Frankish), frm
188       (French, Middle, ca.1400-1600), fry (West Frisian), gla (Scottish
189       Gaelic), gle (Irish), glg (Galician), grc (Greek, Ancient, to 1453),
190       guj (Gujarati), hat (Haitian; Haitian Creole), heb (Hebrew), hin
191       (Hindi), hrv (Croatian), hun (Hungarian), hye (Armenian), iku
192       (Inuktitut), ind (Indonesian), isl (Icelandic), ita (Italian), ita_old
193       (Italian - Old), jav (Javanese), jpn (Japanese), kan (Kannada), kat
194       (Georgian), kat_old (Georgian - Old), kaz (Kazakh), khm (Central
195       Khmer), kir (Kirghiz; Kyrgyz), kmr (Kurdish Kurmanji), kor (Korean),
196       kor_vert (Korean vertical), lao (Lao), lat (Latin), lav (Latvian), lit
197       (Lithuanian), ltz (Luxembourgish), mal (Malayalam), mar (Marathi), mkd
198       (Macedonian), mlt (Maltese), mon (Mongolian), mri (Maori), msa (Malay),
199       mya (Burmese), nep (Nepali), nld (Dutch; Flemish), nor (Norwegian), oci
200       (Occitan post 1500), ori (Oriya), osd (Orientation and script detection
201       module), pan (Panjabi; Punjabi), pol (Polish), por (Portuguese), pus
202       (Pushto; Pashto), que (Quechua), ron (Romanian; Moldavian; Moldovan),
203       rus (Russian), san (Sanskrit), sin (Sinhala; Sinhalese), slk (Slovak),
204       slv (Slovenian), snd (Sindhi), spa (Spanish; Castilian), spa_old
205       (Spanish; Castilian - Old), sqi (Albanian), srp (Serbian), srp_latn
206       (Serbian - Latin), sun (Sundanese), swa (Swahili), swe (Swedish), syr
207       (Syriac), tam (Tamil), tat (Tatar), tel (Telugu), tgk (Tajik), tha
208       (Thai), tir (Tigrinya), ton (Tonga), tur (Turkish), uig (Uighur;
209       Uyghur), ukr (Ukrainian), urd (Urdu), uzb (Uzbek), uzb_cyrl (Uzbek -
210       Cyrilic), vie (Vietnamese), yid (Yiddish), yor (Yoruba)
211
212       To use a non-standard language pack named foo.traineddata, set the
213       TESSDATA_PREFIX environment variable so the file can be found at
214       TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the
215       argument -l foo.
216
217       For Tesseract 4, tessdata_fast includes traineddata files for the
218       following scripts:
219
220       Arabic, Armenian, Bengali, Canadian_Aboriginal, Cherokee, Cyrillic,
221       Devanagari, Ethiopic, Fraktur, Georgian, Greek, Gujarati, Gurmukhi,
222       HanS (Han simplified), HanS_vert (Han simplified, vertical), HanT (Han
223       traditional), HanT_vert (Han traditional, vertical), Hangul,
224       Hangul_vert (Hangul vertical), Hebrew, Japanese, Japanese_vert
225       (Japanese vertical), Kannada, Khmer, Lao, Latin, Malayalam, Myanmar,
226       Oriya (Odia), Sinhala, Syriac, Tamil, Telugu, Thaana, Thai, Tibetan,
227       Vietnamese.
228
229       The same languages and scripts are available from
230       https://github.com/tesseract-ocr/tessdata_best. tessdata_best provides
231       slow language and script models. These models are needed for training.
232       They also can give better OCR results, but the recognition takes much
233       more time.
234
235       Both tessdata_fast and tessdata_best only support the LSTM OCR engine.
236
237       There is a third repository, https://github.com/tesseract-ocr/tessdata,
238       with models which support both the Tesseract 3 legacy OCR engine and
239       the Tesseract 4 LSTM OCR engine.
240

CONFIG FILES AND AUGMENTING WITH USER DATA

242       Tesseract config files consist of lines with parameter-value pairs
243       (space separated). The parameters are documented as flags in the source
244       code like the following one in tesseractclass.h:
245
246       STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to
247       recognize");
248
249       These parameters may enable or disable various features of the engine,
250       and may cause it to load (or not load) various data. For instance,
251       let’s suppose you want to OCR in English, but suppress the normal
252       dictionary and load an alternative word list and an alternative list of
253       patterns — these two files are the most commonly used extra data files.
254
255       If your language pack is in /path/to/eng.traineddata and the hocr
256       config is in /path/to/configs/hocr then create three new files:
257
258       /path/to/eng.user-words:
259
260           the
261           quick
262           brown
263           fox
264           jumped
265
266       /path/to/eng.user-patterns:
267
268           1-\d\d\d-GOOG-411
269           www.\n\\\*.com
270
271       /path/to/configs/bazaar:
272
273           load_system_dawg     F
274           load_freq_dawg       F
275           user_words_suffix    user-words
276           user_patterns_suffix user-patterns
277
278       Now, if you pass the word bazaar as a CONFIGFILE to Tesseract,
279       Tesseract will not bother loading the system dictionary nor the
280       dictionary of frequent words and will load and use the eng.user-words
281       and eng.user-patterns files you provided. The former is a simple word
282       list, one per line. The format of the latter is documented in
283       dict/trie.h on read_pattern_list().
284

ENVIRONMENT VARIABLES

286       TESSDATA_PREFIX
287           If the TESSDATA_PREFIX is set to a path, then that path is used to
288           find the tessdata directory with language and script recognition
289           models and config files. Using --tessdata-dir PATH is the
290           recommended alternative.
291
292       OMP_THREAD_LIMIT
293           If the tesseract executable was built with multithreading support,
294           it will normally use four CPU cores for the OCR process. While this
295           can be faster for a single image, it gives bad performance if the
296           host computer provides less than four CPU cores or if OCR is made
297           for many images. Only a single CPU core is used with
298           OMP_THREAD_LIMIT=1.
299

HISTORY

301       The engine was developed at Hewlett Packard Laboratories Bristol and at
302       Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
303       more changes made in 1996 to port to Windows, and some C++izing in
304       1998. A lot of the code was written in C, and then some more was
305       written in C++. The C++ code makes heavy use of a list system using
306       macros. This predates STL, was portable before STL, and is more
307       efficient than STL lists, but has the big negative that if you do get a
308       segmentation violation, it is hard to debug.
309
310       Version 2.00 brought Unicode (UTF-8) support, six languages, and the
311       ability to train Tesseract.
312
313       Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy.
314       See https://github.com/tesseract-ocr/docs/blob/main/AT-1995.pdf. Since
315       Tesseract 2.00, scripts are now included to allow anyone to reproduce
316       some of these tests. See
317       https://tesseract-ocr.github.io/tessdoc/TestingTesseract.html for more
318       details.
319
320       Tesseract 3.00 added a number of new languages, including Chinese,
321       Japanese, and Korean. It also introduced a new, single-file based
322       system of managing language data.
323
324       Tesseract 3.02 added BiDirectional text support, the ability to
325       recognize multiple languages in a single image, and improved layout
326       analysis.
327
328       Tesseract 4 adds a new neural net (LSTM) based OCR engine which is
329       focused on line recognition, but also still supports the legacy
330       Tesseract OCR engine of Tesseract 3 which works by recognizing
331       character patterns. Compatibility with Tesseract 3 is enabled by --oem
332       0. This also needs traineddata files which support the legacy engine,
333       for example those from the tessdata repository
334       (https://github.com/tesseract-ocr/tessdata).
335
336       For further details, see the release notes in the Tesseract
337       documentation
338       (https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html).
339

RESOURCES

341       Main web site: https://github.com/tesseract-ocr User forum:
342       https://groups.google.com/g/tesseract-ocr Documentation:
343       https://tesseract-ocr.github.io/ Information on training:
344       https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html
345

SEE ALSO

347       ambiguous_words(1), cntraining(1), combine_tessdata(1),
348       dawg2wordlist(1), shape_training(1), mftraining(1), unicharambigs(5),
349       unicharset(5), unicharset_extractor(1), wordlist2dawg(1)
350

AUTHOR

352       Tesseract development was led at Hewlett-Packard and Google by Ray
353       Smith. The development team has included:
354
355       Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David
356       Eger, Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern
357       Wanke, Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil
358       Cheatle, Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith
359       Unnikrishnan, Raquel Romano, Ray Smith, Rika Antonova, Robert Moss,
360       Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.
361
362       For a list of contributors see
363       https://github.com/tesseract-ocr/tesseract/blob/main/AUTHORS.
364

COPYING

366       Licensed under the Apache License, Version 2.0
367
368
369
370                                  09/23/2022                      TESSERACT(1)
Impressum