1COMBINE_TESSDATA(1) COMBINE_TESSDATA(1)
2
3
4
6 combine_tessdata - combine/extract/overwrite Tesseract data
7
9 combine_tessdata [OPTION] FILE...
10
12 combine_tessdata(1) is the main program to combine/extract/overwrite
13 tessdata components in [lang].traineddata files.
14
15 To combine all the individual tessdata components (unicharset, DAWGs,
16 classifier templates, ambiguities, language configs) located at, say,
17 /home/$USER/temp/eng.* run:
18
19 combine_tessdata /home/$USER/temp/eng.
20
21 The result will be a combined tessdata file
22 /home/$USER/temp/eng.traineddata
23
24 Specify option -e if you would like to extract individual components
25 from a combined traineddata file. For example, to extract language
26 config file and the unicharset from tessdata/eng.traineddata run:
27
28 combine_tessdata -e tessdata/eng.traineddata \
29 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
30
31 The desired config file and unicharset will be written to
32 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharset
33
34 Specify option -o to overwrite individual components of the given
35 [lang].traineddata file. For example, to overwrite language config and
36 unichar ambiguities files in tessdata/eng.traineddata use:
37
38 combine_tessdata -o tessdata/eng.traineddata \
39 /home/$USER/temp/eng.config /home/$USER/temp/eng.unicharambigs
40
41 As a result, tessdata/eng.traineddata will contain the new language
42 config and unichar ambigs, plus all the original DAWGs, classifier
43 templates, etc.
44
45 Note: the file names of the files to extract to and to overwrite from
46 should have the appropriate file suffixes (extensions) indicating their
47 tessdata component type (.unicharset for the unicharset, .unicharambigs
48 for unichar ambigs, etc). See k*FileSuffix variable in
49 ccutil/tessdatamanager.h.
50
51 Specify option -u to unpack all the components to the specified path:
52
53 combine_tessdata -u tessdata/eng.traineddata /home/$USER/temp/eng.
54
55 This will create /home/$USER/temp/eng.* files with individual tessdata
56 components from tessdata/eng.traineddata.
57
59 -e .traineddata FILE...: Extracts the specified components from the
60 .traineddata file
61
62 -o .traineddata FILE...: Overwrites the specified components of the
63 .traineddata file with those provided on the comand line.
64
65 -u .traineddata PATHPREFIX Unpacks the .traineddata using the provided
66 prefix.
67
69 Prefix refers to the full file prefix, including period (.)
70
72 The components in a Tesseract lang.traineddata file as of Tesseract
73 3.02 are briefly described below; For more information on many of these
74 files, see
75 https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
76
77 lang.config
78 (Optional) Language-specific overrides to default config variables.
79
80 lang.unicharset
81 (Required) The list of symbols that Tesseract recognizes, with
82 properties. See unicharset(5).
83
84 lang.unicharambigs
85 (Optional) This file contains information on pairs of recognized
86 symbols which are often confused. For example, rn and m.
87
88 lang.inttemp
89 (Required) Character shape templates for each unichar. Produced by
90 mftraining(1).
91
92 lang.pffmtable
93 (Required) The number of features expected for each unichar.
94 Produced by mftraining(1) from .tr files.
95
96 lang.normproto
97 (Required) Character normalization prototypes generated by
98 cntraining(1) from .tr files.
99
100 lang.punc-dawg
101 (Optional) A dawg made from punctuation patterns found around
102 words. The "word" part is replaced by a single space.
103
104 lang.word-dawg
105 (Optional) A dawg made from dictionary words from the language.
106
107 lang.number-dawg
108 (Optional) A dawg made from tokens which originally contained
109 digits. Each digit is replaced by a space character.
110
111 lang.freq-dawg
112 (Optional) A dawg made from the most frequent words which would
113 have gone into word-dawg.
114
115 lang.fixed-length-dawgs
116 (Optional) Several dawgs of different fixed lengths — useful for
117 languages like Chinese.
118
119 lang.cube-unicharset
120 (Optional) A unicharset for cube, if cube was trained on a
121 different set of symbols.
122
123 lang.cube-word-dawg
124 (Optional) A word dawg for cube’s alternate unicharset. Not needed
125 if Cube was trained with Tesseract’s unicharset.
126
127 lang.shapetable
128 (Optional) When present, a shapetable is an extra layer between the
129 character classifier and the word recognizer that allows the
130 character classifier to return a collection of unichar ids and
131 fonts instead of a single unichar-id and font.
132
133 lang.bigram-dawg
134 (Optional) A dawg of word bigrams where the words are separated by
135 a space and each digit is replaced by a ?.
136
137 lang.unambig-dawg
138 (Optional) TODO: Describe.
139
140 lang.params-training-model
141 (Optional) TODO: Describe.
142
144 combine_tessdata(1) first appeared in version 3.00 of Tesseract
145
147 tesseract(1), wordlist2dawg(1), cntraining(1), mftraining(1),
148 unicharset(5), unicharambigs(5)
149
151 Copyright (C) 2009, Google Inc. Licensed under the Apache License,
152 Version 2.0
153
155 The Tesseract OCR engine was written by Ray Smith and his research
156 groups at Hewlett Packard (1985-1995) and Google (2006-present).
157
158
159
160 06/12/2015 COMBINE_TESSDATA(1)