unicharset(5)

1UNICHARSET(5)                                                    UNICHARSET(5)
2
3
4

NAME

6       unicharset - character properties file used by tesseract(1)
7

DESCRIPTION

9       Tesseract’s unicharset file contains information on each symbol
10       (unichar) the Tesseract OCR engine is trained to recognize.
11
12       A unicharset file (i.e. eng.unicharset) is distributed as part of a
13       Tesseract language pack (i.e. eng.traineddata). For information on
14       extracting the unicharset file, see combine_tessdata(1).
15
16       The first line of a unicharset file contains the number of unichars in
17       the file. After this line, each subsequent line provides information
18       for a single unichar. The first such line contains a placeholder
19       reserved for the space character. Each unichar is referred to within
20       Tesseract by its Unichar ID, which is the line number (minus 1) within
21       the unicharset file. Therefore, space gets unichar 0.
22
23       Each unichar line in the unicharset file (v2+) may have four
24       space-separated fields:
25
26           'character' 'properties' 'script' 'id'
27
28       Starting with Tesseract v3.02, more information may be given for each
29       unichar:
30
31           'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
32
33       Entries:
34
35       character
36           The UTF-8 encoded string to be produced for this unichar.
37
38       properties
39           An integer mask of character properties, one per bit. From least to
40           most significant bit, these are: isalpha, islower, isupper,
41           isdigit, ispunctuation.
42
43       glyph_metrics
44           Ten comma-separated integers representing various standards for
45           where this glyph is to be found within a baseline-normalized
46           coordinate system where 128 is normalized to x-height.
47
48           •   min_bottom, max_bottom: the ranges where the bottom of the
49               character can be found.
50
51           •   min_top, max_top: the ranges where the top of the character may
52               be found.
53
54           •   min_width, max_width: horizontal width of the character.
55
56           •   min_bearing, max_bearing: how far from the usual start position
57               does the leftmost part of the character begin.
58
59           •   min_advance, max_advance: how far from the printer’s cell left
60               do we advance to begin the next character.
61
62       script
63           Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
64
65       other_case
66           The Unichar ID of the other case version of this character (upper
67           or lower).
68
69       direction
70           The Unicode BiDi direction of this character, as defined by ICU’s
71           enum UCharDirection. (0 = Left to Right, 1 = Right to Left, 2 =
72           European Number...)
73
74       mirror
75           The Unichar ID of the BiDirectional mirror of this character. For
76           example the mirror of open paren is close paren, but Latin Capital
77           C has no mirror, so it remains a Latin Capital C.
78
79       normed_form
80           The UTF-8 representation of a "normalized form" of this unichar for
81           the purpose of blaming a module for errors given ground truth text.
82           For instance, a left or right single quote may normalize to an
83           ASCII quote.
84

EXAMPLE (V2)

86           ; 10 Common 46
87           b 3 Latin 59
88           W 5 Latin 40
89           7 8 Common 66
90           = 0 Common 93
91
92       ";" is a punctuation character. Its properties are thus represented by
93       the binary number 10000 (10 in hexadecimal).
94
95       "b" is an alphabetic character and a lower case character. Its
96       properties are thus represented by the binary number 00011 (3 in
97       hexadecimal).
98
99       "W" is an alphabetic character and an upper case character. Its
100       properties are thus represented by the binary number 00101 (5 in
101       hexadecimal).
102
103       "7" is just a digit. Its properties are thus represented by the binary
104       number 01000 (8 in hexadecimal).
105
106       "=" is not punctuation nor a digit nor an alphabetic character. Its
107       properties are thus represented by the binary number 00000 (0 in
108       hexadecimal).
109
110       Japanese or Chinese alphabetic character properties are represented by
111       the binary number 00001 (1 in hexadecimal): they are alphabetic, but
112       neither upper nor lower case.
113

EXAMPLE (V3.02)

115           110
116           NULL 0 NULL 0
117           N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
118           Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
119           1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
120           9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
121           a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
122           . . .
123

CAVEATS

125       Although the unicharset reader maintains the ability to read
126       unicharsets of older formats and will assign default values to missing
127       fields, the accuracy will be degraded.
128
129       Further, most other data files are indexed by the unicharset file, so
130       changing it without re-generating the others is likely to have dire
131       consequences.
132

HISTORY

134       The unicharset format first appeared with Tesseract 2.00, which was the
135       first version to support languages other than English. The unicharset
136       file contained only the first two fields, and the "ispunctuation"
137       property was absent (punctuation was regarded as "0", as "=" is in the
138       above example.
139

AUTHOR

146       The Tesseract OCR engine was written by Ray Smith and his research
147       groups at Hewlett Packard (1985-1995) and Google (2006-present).
148
149
150
151                                  03/20/2023                     UNICHARSET(5)