1UNICHARAMBIGS(5) UNICHARAMBIGS(5)
2
3
4
6 unicharambigs - Tesseract unicharset ambiguities
7
9 The unicharambigs file (a component of traineddata, see
10 combine_tessdata(1) ) is used by Tesseract to represent possible
11 ambiguities between characters, or groups of characters.
12
13 The file contains a number of lines, laid out as follow:
14
15 [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]
16
17
18
19 Field one the number of characters
20 contained in field two
21
22 Field two the character sequence to
23 be replaced
24
25 Field three the number of characters
26 contained in field four
27
28 Field four the character sequence
29 used to replace field two
30
31 Field five contains either 1 or 0. 1
32 denotes a mandatory
33 replacement, 0 denotes an
34 optional replacement.
35
36
37 Characters appearing in fields two and four should appear in
38 unicharset. The numbers in fields one and three refer to the number of
39 unichars (not bytes).
40
42 v1
43 2 ' ' 1 " 1
44 1 m 2 r n 0
45 3 i i i 1 m 0
46
47 The first line is a version identifier. In this example, all instances
48 of the 2 character sequence '' will always be replaced by the 1
49 character sequence "; a 1 character sequence m may be replaced by the 2
50 character sequence rn, and the 3 character sequence may be replaced by
51 the 1 character sequence m.
52
53 Version 3.03 and on supports a new, simpler format for the
54 unicharambigs file:
55
56 v2
57 '' " 1
58 m rn 0
59 iii m 0
60
61 In this format, the "error" and "correction" are simple UTF-8 strings
62 separated by a space, and, after another space, the same type specifier
63 as v1 (0 for optional and 1 for mandatory substitution). Note the
64 downside of this simpler format is that Tesseract has to encode the
65 UTF-8 strings into the components of the unicharset. In complex
66 scripts, this encoding may be ambiguous. In this case, the encoding is
67 chosen such as to use the least UTF-8 characters for each component, ie
68 the shortest unicharset components will make up the encoding.
69
71 The unicharambigs file first appeared in Tesseract 3.00; prior to that,
72 a similar format, called DangAmbigs (dangerous ambiguities) was used:
73 the format was almost identical, except only mandatory replacements
74 could be specified, and field 5 was absent.
75
77 This is a documentation "bug": it’s not currently clear what should be
78 done in the case of ligatures (such as fi) which may also appear as
79 regular letters in the unicharset.
80
82 tesseract(1), unicharset(5)
83 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file
84
86 The Tesseract OCR engine was written by Ray Smith and his research
87 groups at Hewlett Packard (1985-1995) and Google (2006-present).
88
89
90
91 03/20/2023 UNICHARAMBIGS(5)