1COMBINE_LANG_MODEL(1) COMBINE_LANG_MODEL(1)
2
3
4
6 combine_lang_model - generate starter traineddata
7
9 combine_lang_model --input_unicharset filename --script_dir dirname
10 --output_dir rootdir --lang lang [--lang_is_rtl] [pass_through_recoder]
11 [--words file --puncs file --numbers file]
12
14 combine_lang_model(1) generates a starter traineddata file that can be
15 used to train an LSTM-based neural network model. It takes as input a
16 unicharset and an optional set of wordlists. It eliminates the need to
17 run set_unicharset_properties(1), wordlist2dawg(1), some non-existent
18 binary to generate the recoder (unicode compressor), and finally
19 combine_tessdata(1).
20
22 --lang lang
23 The language to use. Tesseract uses 3-character ISO 639-2 language
24 codes. (See LANGUAGES)
25
26 --script_dir PATH
27 Directory name for input script unicharsets. It should point to the
28 location of langdata (github repo) directory. (type:string
29 default:)
30
31 --input_unicharset FILE
32 Unicharset to complete and use in encoding. It can be a
33 hand-created file with incomplete fields. Its basic and script
34 properties will be set before it is used. (type:string default:)
35
36 --lang_is_rtl BOOL
37 True if language being processed is written right-to-left (eg
38 Arabic/Hebrew). (type:bool default:false)
39
40 --pass_through_recoder BOOL
41 If true, the recoder is a simple pass-through of the unicharset.
42 Otherwise, potentially a compression of it by encoding Hangul in
43 Jamos, decomposing multi-unicode symbols into sequences of
44 unicodes, and encoding Han using the data in the
45 radical_table_data, which must be the content of the file:
46 langdata/radical-stroke.txt. (type:bool default:false)
47
48 --version_str STRING
49 An arbitrary version label to add to traineddata file (type:string
50 default:)
51
52 --words FILE
53 (Optional) File listing words to use for the system dictionary
54 (type:string default:)
55
56 --numbers FILE
57 (Optional) File listing number patterns (type:string default:)
58
59 --puncs FILE
60 (Optional) File listing punctuation patterns. The
61 words/puncs/numbers lists may be all empty. If any are non-empty
62 then puncs must be non-empty. (type:string default:)
63
64 --output_dir PATH
65 Root directory for output files. Output files will be written to
66 <output_dir>/<lang>/<lang>.* (type:string default:)
67
69 combine_lang_model(1) was first made available for
70 tesseract4.00.00alpha.
71
73 Main web site: https://github.com/tesseract-ocr Information on training
74 tesseract LSTM:
75 https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
76
78 tesseract(1)
79
81 Copyright (C) 2012 Google, Inc. Licensed under the Apache License,
82 Version 2.0
83
85 The Tesseract OCR engine was written by Ray Smith and his research
86 groups at Hewlett Packard (1985-1995) and Google (2006-present).
87
88
89
90 07/22/2023 COMBINE_LANG_MODEL(1)