1WORDLIST2DAWG(1)                                              WORDLIST2DAWG(1)
2
3
4

NAME

6       wordlist2dawg - convert a wordlist to a DAWG for Tesseract
7

SYNOPSIS

9       wordlist2dawg WORDLIST DAWG lang.unicharset
10
11       wordlist2dawg -t WORDLIST DAWG lang.unicharset
12
13       wordlist2dawg -r 1 WORDLIST DAWG lang.unicharset
14
15       wordlist2dawg -r 2 WORDLIST DAWG lang.unicharset
16
17       wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset
18

DESCRIPTION

20       wordlist2dawg(1) converts a wordlist to a Directed Acyclic Word Graph
21       (DAWG) for use with Tesseract. A DAWG is a compressed, space and time
22       efficient representation of a word list.
23

OPTIONS

25       -t Verify that a given dawg file is equivalent to a given wordlist.
26
27       -r 1 Reverse a word if it contains an RTL character.
28
29       -r 2 Reverse all words.
30
31       -l <short> <long> Produce a file with several dawgs in it, one each for
32       words of length <short>, <short+1>,... <long>
33

ARGUMENTS

35       WORDLIST A plain text file in UTF-8, one word per line.
36
37       DAWG The output DAWG to write.
38
39       lang.unicharset The unicharset of the language. This is the unicharset
40       generated by mftraining(1).
41

SEE ALSO

43       tesseract(1), combine_tessdata(1), dawg2wordlist(1)
44
45       https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html
46

COPYING

48       Copyright (C) 2006 Google, Inc. Licensed under the Apache License,
49       Version 2.0
50

AUTHOR

52       The Tesseract OCR engine was written by Ray Smith and his research
53       groups at Hewlett Packard (1985-1995) and Google (2006-present).
54
55
56
57                                  07/22/2023                  WORDLIST2DAWG(1)
Impressum