tesseract(1)

1TESSERACT(1)                                                      TESSERACT(1)
2
3
4

NAME

6       tesseract - command-line OCR engine
7

SYNOPSIS

9       tesseract imagename|stdin outputbase|stdout [options...]
10       [configfile...]
11

DESCRIPTION

13       tesseract(1) is a commercial quality OCR engine originally developed at
14       HP between 1985 and 1995. In 1995, this engine was among the top 3
15       evaluated by UNLV. It was open-sourced by HP and UNLV in 2005, and has
16       been developed at Google since then.
17

IN/OUT ARGUMENTS

19       imagename
20           The name of the input image. Most image file formats (anything
21           readable by Leptonica) are supported.
22
23       stdin
24           Instruction to read data from standard input
25
26       outputbase
27           The basename of the output file (to which the appropriate extension
28           will be appended). By default the output will be named outbase.txt.
29
30       stdout
31           Instruction to sent output data to standard output
32

OPTIONS

34       --tessdata-dir /path
35           Specify the location of tessdata path
36
37       --user-words /path/to/file
38           Specify the location of user words file
39
40       --user-patterns /path/to/file specify
41           The location of user patterns file
42
43       -c configvar=value
44           Set value for control parameter. Multiple -c arguments are allowed.
45
46       -l lang
47           The language to use. If none is specified, English is assumed.
48           Multiple languages may be specified, separated by plus characters.
49           Tesseract uses 3-character ISO 639-2 language codes. (See
50           LANGUAGES)
51
52       --psm N
53           Set Tesseract to only run a subset of layout analysis and assume a
54           certain form of image. The options for N are:
55
56               0 = Orientation and script detection (OSD) only.
57               1 = Automatic page segmentation with OSD.
58               2 = Automatic page segmentation, but no OSD, or OCR.
59               3 = Fully automatic page segmentation, but no OSD. (Default)
60               4 = Assume a single column of text of variable sizes.
61               5 = Assume a single uniform block of vertically aligned text.
62               6 = Assume a single uniform block of text.
63               7 = Treat the image as a single text line.
64               8 = Treat the image as a single word.
65               9 = Treat the image as a single word in a circle.
66               10 = Treat the image as a single character.
67
68       configfile
69           The name of a config to use. A config is a plaintext file which
70           contains a list of variables and their values, one per line, with a
71           space separating variable from value. Interesting config files
72           include:
73
74
75           ·   hocr - Output in hOCR format instead of as a text file.
76
77           ·   pdf - Output in pdf instead of a text file.
78
79       Nota Bene: The options -l lang and --psm N must occur before any
80       configfile.
81

SINGLE OPTIONS

83       -v
84           Returns the current version of the tesseract(1) executable.
85
86       --list-langs
87           list available languages for tesseract engine. Can be used with
88           --tessdata-dir.
89
90       --print-parameters
91           print tesseract parameters to the stdout.
92

LANGUAGES

94       There are currently language packs available for the following
95       languages (in https://github.com/tesseract-ocr/tessdata):
96
97       afr (Afrikaans) amh (Amharic) ara (Arabic) asm (Assamese) aze
98       (Azerbaijani) aze_cyrl (Azerbaijani - Cyrilic) bel (Belarusian) ben
99       (Bengali) bod (Tibetan) bos (Bosnian) bul (Bulgarian) cat (Catalan;
100       Valencian) ceb (Cebuano) ces (Czech) chi_sim (Chinese - Simplified)
101       chi_tra (Chinese - Traditional) chr (Cherokee) cym (Welsh) dan (Danish)
102       dan_frak (Danish - Fraktur) deu (German) deu_frak (German - Fraktur)
103       dzo (Dzongkha) ell (Greek, Modern (1453-)) eng (English) enm (English,
104       Middle (1100-1500)) epo (Esperanto) equ (Math / equation detection
105       module) est (Estonian) eus (Basque) fas (Persian) fin (Finnish) fra
106       (French) frk (Frankish) frm (French, Middle (ca.1400-1600)) gle (Irish)
107       glg (Galician) grc (Greek, Ancient (to 1453)) guj (Gujarati) hat
108       (Haitian; Haitian Creole) heb (Hebrew) hin (Hindi) hrv (Croatian) hun
109       (Hungarian) iku (Inuktitut) ind (Indonesian) isl (Icelandic) ita
110       (Italian) ita_old (Italian - Old) jav (Javanese) jpn (Japanese) kan
111       (Kannada) kat (Georgian) kat_old (Georgian - Old) kaz (Kazakh) khm
112       (Central Khmer) kir (Kirghiz; Kyrgyz) kor (Korean) kur (Kurdish) lao
113       (Lao) lat (Latin) lav (Latvian) lit (Lithuanian) mal (Malayalam) mar
114       (Marathi) mkd (Macedonian) mlt (Maltese) msa (Malay) mya (Burmese) nep
115       (Nepali) nld (Dutch; Flemish) nor (Norwegian) ori (Oriya) osd
116       (Orientation and script detection module) pan (Panjabi; Punjabi) pol
117       (Polish) por (Portuguese) pus (Pushto; Pashto) ron (Romanian;
118       Moldavian; Moldovan) rus (Russian) san (Sanskrit) sin (Sinhala;
119       Sinhalese) slk (Slovak) slk_frak (Slovak - Fraktur) slv (Slovenian) spa
120       (Spanish; Castilian) spa_old (Spanish; Castilian - Old) sqi (Albanian)
121       srp (Serbian) srp_latn (Serbian - Latin) swa (Swahili) swe (Swedish)
122       syr (Syriac) tam (Tamil) tel (Telugu) tgk (Tajik) tgl (Tagalog) tha
123       (Thai) tir (Tigrinya) tur (Turkish) uig (Uighur; Uyghur) ukr
124       (Ukrainian) urd (Urdu) uzb (Uzbek) uzb_cyrl (Uzbek - Cyrilic) vie
125       (Vietnamese) yid (Yiddish)
126
127       To use a non-standard language pack named foo.traineddata, set the
128       TESSDATA_PREFIX environment variable so the file can be found at
129       TESSDATA_PREFIX/tessdata/foo.traineddata and give Tesseract the
130       argument -l foo.
131

CONFIG FILES AND AUGMENTING WITH USER DATA

133       Tesseract config files consist of lines with variable-value pairs
134       (space separated). The variables are documented as flags in the source
135       code like the following one in tesseractclass.h:
136
137       STRING_VAR_H(tessedit_char_blacklist, "", "Blacklist of chars not to
138       recognize");
139
140       These variables may enable or disable various features of the engine,
141       and may cause it to load (or not load) various data. For instance,
142       let’s suppose you want to OCR in English, but suppress the normal
143       dictionary and load an alternative word list and an alternative list of
144       patterns — these two files are the most commonly used extra data files.
145
146       If your language pack is in /path/to/eng.traineddata and the hocr
147       config is in /path/to/configs/hocr then create three new files:
148
149       /path/to/eng.user-words:
150
151           the
152           quick
153           brown
154           fox
155           jumped
156
157       /path/to/eng.user-patterns:
158
159           1-\d\d\d-GOOG-411
160           www.\n\\\*.com
161
162       /path/to/configs/bazaar:
163
164           load_system_dawg     F
165           load_freq_dawg       F
166           user_words_suffix    user-words
167           user_patterns_suffix user-patterns
168
169       Now, if you pass the word bazaar as a trailing command line parameter
170       to Tesseract, Tesseract will not bother loading the system dictionary
171       nor the dictionary of frequent words and will load and use the
172       eng.user-words and eng.user-patterns files you provided. The former is
173       a simple word list, one per line. The format of the latter is
174       documented in dict/trie.h on read_pattern_list().
175

HISTORY

177       The engine was developed at Hewlett Packard Laboratories Bristol and at
178       Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
179       more changes made in 1996 to port to Windows, and some C++izing in
180       1998. A lot of the code was written in C, and then some more was
181       written in C++. The C\++ code makes heavy use of a list system using
182       macros. This predates stl, was portable before stl, and is more
183       efficient than stl lists, but has the big negative that if you do get a
184       segmentation violation, it is hard to debug.
185
186       Version 2.00 brought Unicode (UTF-8) support, six languages, and the
187       ability to train Tesseract.
188
189       Tesseract was included in UNLV’s Fourth Annual Test of OCR Accuracy.
190       See https://github.com/tesseract-ocr/docs/blob/master/AT-1995.pdf. With
191       Tesseract 2.00, scripts are now included to allow anyone to reproduce
192       some of these tests. See
193       https://github.com/tesseract-ocr/tesseract/wiki/TestingTesseract for
194       more details.
195
196       Tesseract 3.00 adds a number of new languages, including Chinese,
197       Japanese, and Korean. It also introduces a new, single-file based
198       system of managing language data.
199
200       Tesseract 3.02 adds BiDirectional text support, the ability to
201       recognize multiple languages in a single image, and improved layout
202       analysis.
203
204       For further details, see the file ReleaseNotes included with the
205       distribution.
206

RESOURCES

208       Main web site: https://github.com/tesseract-ocr Information on
209       training:
210       https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
211

AUTHOR

218       Tesseract development was led at Hewlett-Packard and Google by Ray
219       Smith. The development team has included:
220
221       Ahmad Abdulkader, Chris Newton, Dan Johnson, Dar-Shyang Lee, David
222       Eger, Eric Wiseblatt, Faisal Shafait, Hiroshi Takenaka, Joe Liu, Joern
223       Wanke, Mark Seaman, Mickey Namiki, Nicholas Beato, Oded Fuhrmann, Phil
224       Cheatle, Pingping Xiu, Pong Eksombatchai (Chantat), Ranjith
225       Unnikrishnan, Raquel Romano, Ray Smith, Rika Antonova, Robert Moss,
226       Samuel Charron, Sheelagh Lloyd, Shobhit Saxena, and Thomas Kielbus.
227

COPYING

229       Licensed under the Apache License, Version 2.0
230
231
232
233                                  06/28/2015                      TESSERACT(1)