1COMBINE_TESSDATA(1) COMBINE_TESSDATA(1)
2
3
4
6 combine_tessdata - combine/extract/overwrite/list/compact Tesseract
7 data
8
10 combine_tessdata [OPTION] FILE...
11
13 combine_tessdata(1) is the main program to
14 combine/extract/overwrite/list/compact tessdata components in
15 [lang].traineddata files.
16
17 To combine all the individual tessdata components (unicharset, DAWGs,
18 classifier templates, ambiguities, language configs) located at, say,
19 /home/$USER/temp/eng.* run:
20
21 combine_tessdata /home/$USER/temp/eng.
22
23 The result will be a combined tessdata file
24 /home/$USER/temp/eng.traineddata
25
26 Specify option -e if you would like to extract individual components
27 from a combined traineddata file. For example, to extract language
28 config file and the unicharset from tessdata/eng.traineddata run:
29
30 combine_tessdata -e tessdata/eng.traineddata \
31 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
32
33 The desired config file and unicharset will be written to
34 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
35
36 Specify option -o to overwrite individual components of the given
37 [lang].traineddata file. For example, to overwrite language config and
38 unichar ambiguities files in tessdata/eng.traineddata use:
39
40 combine_tessdata -o tessdata/eng.traineddata \
41 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
42
43 As a result, tessdata/eng.traineddata will contain the new language
44 config and unichar ambigs, plus all the original DAWGs, classifier
45 templates, etc.
46
47 Note: the file names of the files to extract to and to overwrite from
48 should have the appropriate file suffixes (extensions) indicating their
49 tessdata component type (.unicharset for the unicharset, .unicharambigs
50 for unichar ambigs, etc). See k*FileSuffix variable in
51 ccutil/tessdatamanager.h.
52
53 Specify option -u to unpack all the components to the specified path:
54
55 combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
56
57 This will create /home/$USER/temp/eng.* files with individual tessdata
58 components from tessdata/eng.traineddata.
59
61 -c .traineddata FILE...: Compacts the LSTM component in the
62 .traineddata file to int.
63
64 -d .traineddata FILE...: Lists directory of components from the
65 .traineddata file.
66
67 -e .traineddata FILE...: Extracts the specified components from the
68 .traineddata file
69
70 -l .traineddata FILE...: List the network information.
71
72 -o .traineddata FILE...: Overwrites the specified components of the
73 .traineddata file with those provided on the command line.
74
75 -u .traineddata PATHPREFIX Unpacks the .traineddata using the provided
76 prefix.
77
79 Prefix refers to the full file prefix, including period (.)
80
82 The components in a Tesseract lang.traineddata file as of Tesseract 4.0
83 are briefly described below; For more information on many of these
84 files, see
85 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html and
86 https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
87
88 lang.config
89 (Optional) Language-specific overrides to default config variables.
90 For 4.0 traineddata files, lang.config provides control parameters
91 which can affect layout analysis, and sub-languages.
92
93 lang.unicharset
94 (Required - 3.0x legacy tesseract) The list of symbols that
95 Tesseract recognizes, with properties. See unicharset(5).
96
97 lang.unicharambigs
98 (Optional - 3.0x legacy tesseract) This file contains information
99 on pairs of recognized symbols which are often confused. For
100 example, rn and m.
101
102 lang.inttemp
103 (Required - 3.0x legacy tesseract) Character shape templates for
104 each unichar. Produced by mftraining(1).
105
106 lang.pffmtable
107 (Required - 3.0x legacy tesseract) The number of features expected
108 for each unichar. Produced by mftraining(1) from .tr files.
109
110 lang.normproto
111 (Required - 3.0x legacy tesseract) Character normalization
112 prototypes generated by cntraining(1) from .tr files.
113
114 lang.punc-dawg
115 (Optional - 3.0x legacy tesseract) A dawg made from punctuation
116 patterns found around words. The "word" part is replaced by a
117 single space.
118
119 lang.word-dawg
120 (Optional - 3.0x legacy tesseract) A dawg made from dictionary
121 words from the language.
122
123 lang.number-dawg
124 (Optional - 3.0x legacy tesseract) A dawg made from tokens which
125 originally contained digits. Each digit is replaced by a space
126 character.
127
128 lang.freq-dawg
129 (Optional - 3.0x legacy tesseract) A dawg made from the most
130 frequent words which would have gone into word-dawg.
131
132 lang.fixed-length-dawgs
133 (Optional - 3.0x legacy tesseract) Several dawgs of different fixed
134 lengths — useful for languages like Chinese.
135
136 lang.shapetable
137 (Optional - 3.0x legacy tesseract) When present, a shapetable is an
138 extra layer between the character classifier and the word
139 recognizer that allows the character classifier to return a
140 collection of unichar ids and fonts instead of a single unichar-id
141 and font.
142
143 lang.bigram-dawg
144 (Optional - 3.0x legacy tesseract) A dawg of word bigrams where the
145 words are separated by a space and each digit is replaced by a ?.
146
147 lang.unambig-dawg
148 (Optional - 3.0x legacy tesseract) .
149
150 lang.params-model
151 (Optional - 3.0x legacy tesseract) .
152
153 lang.lstm
154 (Required - 4.0 LSTM) Neural net trained recognition model
155 generated by lstmtraining.
156
157 lang.lstm-punc-dawg
158 (Optional - 4.0 LSTM) A dawg made from punctuation patterns found
159 around words. The "word" part is replaced by a single space. Uses
160 lang.lstm-unicharset.
161
162 lang.lstm-word-dawg
163 (Optional - 4.0 LSTM) A dawg made from dictionary words from the
164 language. Uses lang.lstm-unicharset.
165
166 lang.lstm-number-dawg
167 (Optional - 4.0 LSTM) A dawg made from tokens which originally
168 contained digits. Each digit is replaced by a space character. Uses
169 lang.lstm-unicharset.
170
171 lang.lstm-unicharset
172 (Required - 4.0 LSTM) The unicode character set that Tesseract
173 recognizes, with properties. Same unicharset must be used to train
174 the LSTM and build the lstm-*-dawgs files.
175
176 lang.lstm-recoder
177 (Required - 4.0 LSTM) Unicharcompress, aka the recoder, which maps
178 the unicharset further to the codes actually used by the neural
179 network recognizer. This is created as part of the starter
180 traineddata by combine_lang_model.
181
182 lang.version
183 (Optional) Version string for the traineddata file. First appeared
184 in version 4.0 of Tesseract. Old version of traineddata files will
185 report Version:Pre-4.0.0. 4.0 version of traineddata files may
186 include the network spec used for LSTM training as part of version
187 string.
188
190 combine_tessdata(1) first appeared in version 3.00 of Tesseract
191
193 tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1),
194 unicharset(5), unicharambigs(5)
195
197 Copyright (C) 2009, Google Inc. Licensed under the Apache License,
198 Version 2.0
199
201 The Tesseract OCR engine was written by Ray Smith and his research
202 groups at Hewlett Packard (1985-1995) and Google (2006-present).
203
204
205
206 07/22/2023 COMBINE_TESSDATA(1)