1COMBINE_LANG_MODEL(1)                                    COMBINE_LANG_MODEL(1)
2
3
4

NAME

6       combine_lang_model - generate starter traineddata
7

SYNOPSIS

9       combine_lang_model --input_unicharset filename --script_dir dirname
10       --output_dir rootdir --lang lang [--lang_is_rtl] [pass_through_recoder]
11       [--words file --puncs file --numbers file]
12

DESCRIPTION

14       combine_lang_model(1) generates a starter traineddata file that can be
15       used to train an LSTM-based neural network model. It takes as input a
16       unicharset and an optional set of wordlists. It eliminates the need to
17       run set_unicharset_properties(1), wordlist2dawg(1), some non-existent
18       binary to generate the recoder (unicode compressor), and finally
19       combine_tessdata(1).
20

OPTIONS

22       --lang lang
23           The language to use. Tesseract uses 3-character ISO 639-2 language
24           codes. (See LANGUAGES)
25
26       --script_dir PATH
27           Directory name for input script unicharsets. It should point to the
28           location of langdata (github repo) directory. (type:string
29           default:)
30
31       --input_unicharset FILE
32           Unicharset to complete and use in encoding. It can be a
33           hand-created file with incomplete fields. Its basic and script
34           properties will be set before it is used. (type:string default:)
35
36       --lang_is_rtl BOOL
37           True if language being processed is written right-to-left (eg
38           Arabic/Hebrew). (type:bool default:false)
39
40       --pass_through_recoder BOOL
41           If true, the recoder is a simple pass-through of the unicharset.
42           Otherwise, potentially a compression of it by encoding Hangul in
43           Jamos, decomposing multi-unicode symbols into sequences of
44           unicodes, and encoding Han using the data in the
45           radical_table_data, which must be the content of the file:
46           langdata/radical-stroke.txt. (type:bool default:false)
47
48       --version_str STRING
49           An arbitrary version label to add to traineddata file (type:string
50           default:)
51
52       --words FILE
53           (Optional) File listing words to use for the system dictionary
54           (type:string default:)
55
56       --numbers FILE
57           (Optional) File listing number patterns (type:string default:)
58
59       --puncs FILE
60           (Optional) File listing punctuation patterns. The
61           words/puncs/numbers lists may be all empty. If any are non-empty
62           then puncs must be non-empty. (type:string default:)
63
64       --output_dir PATH
65           Root directory for output files. Output files will be written to
66           <output_dir>/<lang>/<lang>.* (type:string default:)
67

HISTORY

69       combine_lang_model(1) was first made available for
70       tesseract4.00.00alpha.
71

RESOURCES

73       Main web site: https://github.com/tesseract-ocr Information on training
74       tesseract LSTM:
75       https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
76

SEE ALSO

78       tesseract(1)
79

COPYING

81       Copyright (C) 2012 Google, Inc. Licensed under the Apache License,
82       Version 2.0
83

AUTHOR

85       The Tesseract OCR engine was written by Ray Smith and his research
86       groups at Hewlett Packard (1985-1995) and Google (2006-present).
87
88
89
90                                  07/22/2023             COMBINE_LANG_MODEL(1)
Impressum