1COMBINE_TESSDATA(1)                                        COMBINE_TESSDATA(1)
2
3
4

NAME

6       combine_tessdata - combine/extract/overwrite/list/compact Tesseract
7       data
8

SYNOPSIS

10       combine_tessdata [OPTION] FILE...
11

DESCRIPTION

13       combine_tessdata(1) is the main program to
14       combine/extract/overwrite/list/compact tessdata components in
15       [lang].traineddata files.
16
17       To combine all the individual tessdata components (unicharset, DAWGs,
18       classifier templates, ambiguities, language configs) located at, say,
19       /home/$USER/temp/eng.* run:
20
21           combine_tessdata /home/$USER/temp/eng.
22
23       The result will be a combined tessdata file
24       /home/$USER/temp/eng.traineddata
25
26       Specify option -e if you would like to extract individual components
27       from a combined traineddata file. For example, to extract language
28       config file and the unicharset from tessdata/eng.traineddata run:
29
30           combine_tessdata -e tessdata/eng.traineddata \
31             /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
32
33       The desired config file and unicharset will be written to
34       /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
35
36       Specify option -o to overwrite individual components of the given
37       [lang].traineddata file. For example, to overwrite language config and
38       unichar ambiguities files in tessdata/eng.traineddata use:
39
40           combine_tessdata -o tessdata/eng.traineddata \
41             /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
42
43       As a result, tessdata/eng.traineddata will contain the new language
44       config and unichar ambigs, plus all the original DAWGs, classifier
45       templates, etc.
46
47       Note: the file names of the files to extract to and to overwrite from
48       should have the appropriate file suffixes (extensions) indicating their
49       tessdata component type (.unicharset for the unicharset, .unicharambigs
50       for unichar ambigs, etc). See k*FileSuffix variable in
51       ccutil/tessdatamanager.h.
52
53       Specify option -u to unpack all the components to the specified path:
54
55           combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
56
57       This will create /home/$USER/temp/eng.* files with individual tessdata
58       components from tessdata/eng.traineddata.
59

OPTIONS

61       -c .traineddata FILE...: Compacts the LSTM component in the
62       .traineddata file to int.
63
64       -d .traineddata FILE...: Lists directory of components from the
65       .traineddata file.
66
67       -e .traineddata FILE...: Extracts the specified components from the
68       .traineddata file
69
70       -l .traineddata FILE...: List the network information.
71
72       -o .traineddata FILE...: Overwrites the specified components of the
73       .traineddata file with those provided on the command line.
74
75       -u .traineddata PATHPREFIX Unpacks the .traineddata using the provided
76       prefix.
77

CAVEATS

79       Prefix refers to the full file prefix, including period (.)
80

COMPONENTS

82       The components in a Tesseract lang.traineddata file as of Tesseract 4.0
83       are briefly described below; For more information on many of these
84       files, see
85       https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html and
86       https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
87
88       lang.config
89           (Optional) Language-specific overrides to default config variables.
90           For 4.0 traineddata files, lang.config provides control parameters
91           which can affect layout analysis, and sub-languages.
92
93       lang.unicharset
94           (Required - 3.0x legacy tesseract) The list of symbols that
95           Tesseract recognizes, with properties. See unicharset(5).
96
97       lang.unicharambigs
98           (Optional - 3.0x legacy tesseract) This file contains information
99           on pairs of recognized symbols which are often confused. For
100           example, rn and m.
101
102       lang.inttemp
103           (Required - 3.0x legacy tesseract) Character shape templates for
104           each unichar. Produced by mftraining(1).
105
106       lang.pffmtable
107           (Required - 3.0x legacy tesseract) The number of features expected
108           for each unichar. Produced by mftraining(1) from .tr files.
109
110       lang.normproto
111           (Required - 3.0x legacy tesseract) Character normalization
112           prototypes generated by cntraining(1) from .tr files.
113
114       lang.punc-dawg
115           (Optional - 3.0x legacy tesseract) A dawg made from punctuation
116           patterns found around words. The "word" part is replaced by a
117           single space.
118
119       lang.word-dawg
120           (Optional - 3.0x legacy tesseract) A dawg made from dictionary
121           words from the language.
122
123       lang.number-dawg
124           (Optional - 3.0x legacy tesseract) A dawg made from tokens which
125           originally contained digits. Each digit is replaced by a space
126           character.
127
128       lang.freq-dawg
129           (Optional - 3.0x legacy tesseract) A dawg made from the most
130           frequent words which would have gone into word-dawg.
131
132       lang.fixed-length-dawgs
133           (Optional - 3.0x legacy tesseract) Several dawgs of different fixed
134           lengths — useful for languages like Chinese.
135
136       lang.shapetable
137           (Optional - 3.0x legacy tesseract) When present, a shapetable is an
138           extra layer between the character classifier and the word
139           recognizer that allows the character classifier to return a
140           collection of unichar ids and fonts instead of a single unichar-id
141           and font.
142
143       lang.bigram-dawg
144           (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the
145           words are separated by a space and each digit is replaced by a ?.
146
147       lang.unambig-dawg
148           (Optional - 3.0x legacy tesseract) .
149
150       lang.params-model
151           (Optional - 3.0x legacy tesseract) .
152
153       lang.lstm
154           (Required - 4.0 LSTM) Neural net trained recognition model
155           generated by lstmtraining.
156
157       lang.lstm-punc-dawg
158           (Optional - 4.0 LSTM) A dawg made from punctuation patterns found
159           around words. The "word" part is replaced by a single space. Uses
160           lang.lstm-unicharset.
161
162       lang.lstm-word-dawg
163           (Optional - 4.0 LSTM) A dawg made from dictionary words from the
164           language. Uses lang.lstm-unicharset.
165
166       lang.lstm-number-dawg
167           (Optional - 4.0 LSTM) A dawg made from tokens which originally
168           contained digits. Each digit is replaced by a space character. Uses
169           lang.lstm-unicharset.
170
171       lang.lstm-unicharset
172           (Required - 4.0 LSTM) The unicode character set that Tesseract
173           recognizes, with properties. Same unicharset must be used to train
174           the LSTM and build the lstm-*-dawgs files.
175
176       lang.lstm-recoder
177           (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps
178           the unicharset further to the codes actually used by the neural
179           network recognizer. This is created as part of the starter
180           traineddata by combine_lang_model.
181
182       lang.version
183           (Optional) Version string for the traineddata file. First appeared
184           in version 4.0 of Tesseract. Old version of traineddata files will
185           report Version:Pre-4.0.0. 4.0 version of traineddata files may
186           include the network spec used for LSTM training as part of version
187           string.
188

HISTORY

190       combine_tessdata(1) first appeared in version 3.00 of Tesseract
191

SEE ALSO

193       tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1),
194       unicharset(5), unicharambigs(5)
195

COPYING

197       Copyright (C) 2009, Google Inc. Licensed under the Apache License,
198       Version 2.0
199

AUTHOR

201       The Tesseract OCR engine was written by Ray Smith and his research
202       groups at Hewlett Packard (1985-1995) and Google (2006-present).
203
204
205
206                                  07/22/2023               COMBINE_TESSDATA(1)
Impressum