1WORDLIST2DAWG(1) WORDLIST2DAWG(1)
2
3
4
6 wordlist2dawg - convert a wordlist to a DAWG for Tesseract
7
9 wordlist2dawg WORDLIST DAWG lang.unicharset
10
11 wordlist2dawg -t WORDLIST DAWG lang.unicharset
12
13 wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset
14
15 wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset
16
17 wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset
18
20 wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
21 (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
22 efficient representation of a word list.
23
25 -t Verify that a given dawg file is equivalent to a given wordlist.
26
27 -r 1 Reverse a word if it contains an RTL character.
28
29 -r 2 Reverse all words.
30
31 -l <short> <long> Produce a file with several dawgs in it, one each for
32 words of length <short>, <short+1>,... <long>
33
35 WORDLIST A plain text file in UTF-8, one word per line.
36
37 DAWG The output DAWG to write.
38
39 lang.unicharset The unicharset of the language. This is the unicharset
40 generated by mftraining(1).
41
43 tesseract(1), combine_tessdata(1), dawg2wordlist(1)
44
45 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html
46
48 Copyright (C) 2006 Google, Inc. Licensed under the Apache License,
49 Version 2.0
50
52 The Tesseract OCR engine was written by Ray Smith and his research
53 groups at Hewlett Packard (1985-1995) and Google (2006-present).
54
55
56
57 07/22/2023 WORDLIST2DAWG(1)