unicharambigs(5)

1UNICHARAMBIGS(5)                                              UNICHARAMBIGS(5)
2
3
4

NAME

6       unicharambigs - Tesseract unicharset ambiguities
7

DESCRIPTION

9       The unicharambigs file (a component of traineddata, see
10       combine_tessdata(1) ) is used by Tesseract to represent possible
11       ambiguities between characters, or groups of characters.
12
13       The file contains a number of lines, laid out as follow:
14
15           [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
16
17
18
19       Field one     the number of characters
20                     contained in field two
21
22       Field two     the character sequence to
23                     be replaced
24
25       Field three   the number of characters
26                     contained in field four
27
28       Field four    the character sequence
29                     used to replace field two
30
31       Field five    contains either 1 or 0. 1
32                     denotes a mandatory
33                     replacement, 0 denotes an
34                     optional replacement.
35
36
37       Characters appearing in fields two and four should appear in
38       unicharset. The numbers in fields one and three refer to the number of
39       unichars (not bytes).
40

EXAMPLE

42           2       ' '     1       "     1
43           1       m       2       r n   0
44           3       i i i   1       m     0
45
46       In this example, all instances of the 2 character sequence '' will
47       always be replaced by the 1 character sequence "; a 1 character
48       sequence m may be replaced by the 2 character sequence rn, and the 3
49       character sequence may be replaced by the 1 character sequence m.
50

HISTORY

52       The unicharambigs file first appeared in Tesseract 3.00; prior to that,
53       a similar format, called DangAmbigs (dangerous ambiguities) was used:
54       the format was almost identical, except only mandatory replacements
55       could be specified, and field 5 was absent.
56

BUGS

58       This is a documentation "bug": it’s not currently clear what should be
59       done in the case of ligatures (such as fi) which may also appear as
60       regular letters in the unicharset.
61

AUTHOR

66       The Tesseract OCR engine was written by Ray Smith and his research
67       groups at Hewlett Packard (1985-1995) and Google (2006-present).
68
69
70
71                                  06/12/2015                  UNICHARAMBIGS(5)

NAME

DESCRIPTION

EXAMPLE

HISTORY

BUGS

SEE ALSO

AUTHOR