1COMBINE_TESSDATA(1)                                        COMBINE_TESSDATA(1)
2
3
4

NAME

6       combine_tessdata - combine/extract/overwrite Tesseract data
7

SYNOPSIS

9       combine_tessdata [OPTION] FILE...
10

DESCRIPTION

12       combine_tessdata(1) is the main program to combine/extract/overwrite
13       tessdata components in [lang].traineddata files.
14
15       To combine all the individual tessdata components (unicharset, DAWGs,
16       classifier templates, ambiguities, language configs) located at, say,
17       /home/$USER/temp/eng.* run:
18
19           combine_tessdata /home/$USER/temp/eng.
20
21       The result will be a combined tessdata file
22       /home/$USER/temp/eng.traineddata
23
24       Specify option -e if you would like to extract individual components
25       from a combined traineddata file. For example, to extract language
26       config file and the unicharset from tessdata/eng.traineddata run:
27
28           combine_tessdata -e tessdata/eng.traineddata \
29             /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
30
31       The desired config file and unicharset will be written to
32       /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
33
34       Specify option -o to overwrite individual components of the given
35       [lang].traineddata file. For example, to overwrite language config and
36       unichar ambiguities files in tessdata/eng.traineddata use:
37
38           combine_tessdata -o tessdata/eng.traineddata \
39             /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
40
41       As a result, tessdata/eng.traineddata will contain the new language
42       config and unichar ambigs, plus all the original DAWGs, classifier
43       templates, etc.
44
45       Note: the file names of the files to extract to and to overwrite from
46       should have the appropriate file suffixes (extensions) indicating their
47       tessdata component type (.unicharset for the unicharset, .unicharambigs
48       for unichar ambigs, etc). See k*FileSuffix variable in
49       ccutil/tessdatamanager.h.
50
51       Specify option -u to unpack all the components to the specified path:
52
53           combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
54
55       This will create /home/$USER/temp/eng.* files with individual tessdata
56       components from tessdata/eng.traineddata.
57

OPTIONS

59       -e .traineddata FILE...: Extracts the specified components from the
60       .traineddata file
61
62       -o .traineddata FILE...: Overwrites the specified components of the
63       .traineddata file with those provided on the comand line.
64
65       -u .traineddata PATHPREFIX Unpacks the .traineddata using the provided
66       prefix.
67

CAVEATS

69       Prefix refers to the full file prefix, including period (.)
70

COMPONENTS

72       The components in a Tesseract lang.traineddata file as of Tesseract
73       3.02 are briefly described below; For more information on many of these
74       files, see
75       https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
76
77       lang.config
78           (Optional) Language-specific overrides to default config variables.
79
80       lang.unicharset
81           (Required) The list of symbols that Tesseract recognizes, with
82           properties. See unicharset(5).
83
84       lang.unicharambigs
85           (Optional) This file contains information on pairs of recognized
86           symbols which are often confused. For example, rn and m.
87
88       lang.inttemp
89           (Required) Character shape templates for each unichar. Produced by
90           mftraining(1).
91
92       lang.pffmtable
93           (Required) The number of features expected for each unichar.
94           Produced by mftraining(1) from .tr files.
95
96       lang.normproto
97           (Required) Character normalization prototypes generated by
98           cntraining(1) from .tr files.
99
100       lang.punc-dawg
101           (Optional) A dawg made from punctuation patterns found around
102           words. The "word" part is replaced by a single space.
103
104       lang.word-dawg
105           (Optional) A dawg made from dictionary words from the language.
106
107       lang.number-dawg
108           (Optional) A dawg made from tokens which originally contained
109           digits. Each digit is replaced by a space character.
110
111       lang.freq-dawg
112           (Optional) A dawg made from the most frequent words which would
113           have gone into word-dawg.
114
115       lang.fixed-length-dawgs
116           (Optional) Several dawgs of different fixed lengths — useful for
117           languages like Chinese.
118
119       lang.cube-unicharset
120           (Optional) A unicharset for cube, if cube was trained on a
121           different set of symbols.
122
123       lang.cube-word-dawg
124           (Optional) A word dawg for cube’s alternate unicharset. Not needed
125           if Cube was trained with Tesseract’s unicharset.
126
127       lang.shapetable
128           (Optional) When present, a shapetable is an extra layer between the
129           character classifier and the word recognizer that allows the
130           character classifier to return a collection of unichar ids and
131           fonts instead of a single unichar-id and font.
132
133       lang.bigram-dawg
134           (Optional) A dawg of word bigrams where the words are separated by
135           a space and each digit is replaced by a ?.
136
137       lang.unambig-dawg
138           (Optional) TODO: Describe.
139
140       lang.params-training-model
141           (Optional) TODO: Describe.
142

HISTORY

144       combine_tessdata(1) first appeared in version 3.00 of Tesseract
145

SEE ALSO

147       tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1),
148       unicharset(5), unicharambigs(5)
149

COPYING

151       Copyright (C) 2009, Google Inc. Licensed under the Apache License,
152       Version 2.0
153

AUTHOR

155       The Tesseract OCR engine was written by Ray Smith and his research
156       groups at Hewlett Packard (1985-1995) and Google (2006-present).
157
158
159
160                                  06/12/2015               COMBINE_TESSDATA(1)
Impressum