1UNICHARSET(5) UNICHARSET(5)
2
3
4
6 unicharset - character properties file used by tesseract(1)
7
9 Tesseract’s unicharset file contains information on each symbol
10 (unichar) the Tesseract OCR engine is trained to recognize.
11
12 A unicharset file (i.e. eng.unicharset) is distributed as part of a
13 Tesseract language pack (i.e. eng.traineddata). For information on
14 extracting the unicharset file, see combine_tessdata(1).
15
16 The first line of a unicharset file contains the number of unichars in
17 the file. After this line, each subsequent line provides information
18 for a single unichar. The first such line contains a placeholder
19 reserved for the space character. Each unichar is referred to within
20 Tesseract by its Unichar ID, which is the line number (minus 1) within
21 the unicharset file. Therefore, space gets unichar 0.
22
23 Each unichar line in the unicharset file (v2+) may have four
24 space-separated fields:
25
26 'character' 'properties' 'script' 'id'
27
28 Starting with Tesseract v3.02, more information may be given for each
29 unichar:
30
31 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
32
33 Entries:
34
35 character
36 The UTF-8 encoded string to be produced for this unichar.
37
38 properties
39 An integer mask of character properties, one per bit. From least to
40 most significant bit, these are: isalpha, islower, isupper,
41 isdigit, ispunctuation.
42
43 glyph_metrics
44 Ten comma-separated integers representing various standards for
45 where this glyph is to be found within a baseline-normalized
46 coordinate system where 128 is normalized to x-height.
47
48 • min_bottom, max_bottom: the ranges where the bottom of the
49 character can be found.
50
51 • min_top, max_top: the ranges where the top of the character may
52 be found.
53
54 • min_width, max_width: horizontal width of the character.
55
56 • min_bearing, max_bearing: how far from the usual start position
57 does the leftmost part of the character begin.
58
59 • min_advance, max_advance: how far from the printer’s cell left
60 do we advance to begin the next character.
61
62 script
63 Name of the script (Latin, Common, Greek, Cyrillic, Han, null).
64
65 other_case
66 The Unichar ID of the other case version of this character (upper
67 or lower).
68
69 direction
70 The Unicode BiDi direction of this character, as defined by ICU’s
71 enum UCharDirection. (0 = Left to Right, 1 = Right to Left, 2 =
72 European Number...)
73
74 mirror
75 The Unichar ID of the BiDirectional mirror of this character. For
76 example the mirror of open paren is close paren, but Latin Capital
77 C has no mirror, so it remains a Latin Capital C.
78
79 normed_form
80 The UTF-8 representation of a "normalized form" of this unichar for
81 the purpose of blaming a module for errors given ground truth text.
82 For instance, a left or right single quote may normalize to an
83 ASCII quote.
84
86 ; 10 Common 46
87 b 3 Latin 59
88 W 5 Latin 40
89 7 8 Common 66
90 = 0 Common 93
91
92 ";" is a punctuation character. Its properties are thus represented by
93 the binary number 10000 (10 in hexadecimal).
94
95 "b" is an alphabetic character and a lower case character. Its
96 properties are thus represented by the binary number 00011 (3 in
97 hexadecimal).
98
99 "W" is an alphabetic character and an upper case character. Its
100 properties are thus represented by the binary number 00101 (5 in
101 hexadecimal).
102
103 "7" is just a digit. Its properties are thus represented by the binary
104 number 01000 (8 in hexadecimal).
105
106 "=" is not punctuation nor a digit nor an alphabetic character. Its
107 properties are thus represented by the binary number 00000 (0 in
108 hexadecimal).
109
110 Japanese or Chinese alphabetic character properties are represented by
111 the binary number 00001 (1 in hexadecimal): they are alphabetic, but
112 neither upper nor lower case.
113
115 110
116 NULL 0 NULL 0
117 N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
118 Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
119 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
120 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
121 a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
122 . . .
123
125 Although the unicharset reader maintains the ability to read
126 unicharsets of older formats and will assign default values to missing
127 fields, the accuracy will be degraded.
128
129 Further, most other data files are indexed by the unicharset file, so
130 changing it without re-generating the others is likely to have dire
131 consequences.
132
134 The unicharset format first appeared with Tesseract 2.00, which was the
135 first version to support languages other than English. The unicharset
136 file contained only the first two fields, and the "ispunctuation"
137 property was absent (punctuation was regarded as "0", as "=" is in the
138 above example.
139
141 tesseract(1), combine_tessdata(1), unicharset_extractor(1)
142
143 https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html
144
146 The Tesseract OCR engine was written by Ray Smith and his research
147 groups at Hewlett Packard (1985-1995) and Google (2006-present).
148
149
150
151 03/20/2023 UNICHARSET(5)