1UNICODE_CANONICAL(3) Courier Unicode Library UNICODE_CANONICAL(3)
2
3
4
6 unicode_canonical, unicode_ccc, unicode_decomposition_init,
7 unicode_decomposition_deinit, unicode_decompose,
8 unicode_decompose_reallocate_size, unicode_compose,
9 unicode_composition_init, unicode_composition_deinit,
10 unicode_composition_apply - unicode canonical normalization and
11 denormalization
12
14 #include <courier-unicode.h>
15
16 unicode_canonical_t unicode_canonical(char32_t c);
17
18 uint8_t unicode_ccc(char32_t c);
19
20 void unicode_decomposition_init(unicode_decomposition_t *info,
21 char32_t *string, size_t *string_size,
22 void *arg);
23
24 int unicode_decompose(unicode_decomposition_t *info);
25
26 void unicode_decomposition_deinit(unicode_decomposition_t *info);
27
28 size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info,
29 const size_t *sizes,
30 size_t n);
31
32 int unicode_compose(char32_t *string, size_t string_size, int flags,
33 size_t *new_size);
34
35 int unicode_composition_init(const char32_t *string,
36 size_t string_size, int flags,
37 unicode_composition_t *compositions);
38
39 void unicode_composition_deinit(unicode_composition_t *compositions);
40
41 size_t unicode_composition_apply(char32_t *string, size_t string_size,
42 unicode_composition_t *compositions);
43
45 These functions compose or decompose a Unicode string into a canonical
46 or a compatible normalized form.
47
48 unicode_canonical() looks up the character's canonical and
49 compatibility mapping[1]. unicode_canonical() returns a structure with
50 the following fields:
51
52 canonical_chars
53 A pointer to the canonical or equivalent representation of the
54 character.
55
56 n_canonical_chars
57 Number of characters in the canonical_chars.
58
59 format
60 A value of UNICODE_CANONICAL_FMT_NONE indicates a canonical
61 mapping, other values indicate a compatibility equivalent mapping.
62
63 A NULL canonical_chars (with a 0 n_canonical_chars) indicates that the
64 character has no canonical or compatibility equivalence.
65
66 unicode_ccc() returns the character's canonical combining class value.
67
68 unicode_decomposition_init(), unicode_decompose() and
69 unicode_decomposition_deinit() implement a complete interface for
70 decomposing a Unicode string:
71
72 unicode_decomposition_t info;
73
74 unicode_decomposition_init(&info, before, (size_t)-1, NULL);
75 info.decompose_flags=UNICODE_DECOMPOSE_FLAG_QC;
76 unicode_decompose(&info);
77 unicode_decomposition_deinit(&info);
78
79 unicode_decomposition_init() initializes a new unicode_decomposition_t
80 structure, that gets passed in as its first parameter. The second
81 parameter is a pointer to a Unicode string, with the number of
82 characters in the string in the third parameter. A string size of -1
83 indicates a \0-terminated string and calculates its string_size (which
84 does not include the trailing \0. The last parameter is a void *, an
85 opaque pointer that gets stored in the initialized
86 unicode_decomposition_t object:
87
88 typedef struct unicode_decomposition {
89 char32_t *string;
90 size_t string_size;
91 int decompose_flags;
92 int (*reallocate)(
93 struct unicode_decomposition *info,
94 const size_t *offsets,
95 const size_t *sizes,
96 size_t n
97 );
98 void *arg;
99 } unicode_decomposition_t;
100
101
102 unicode_decompose() proceeds and decomposes the string and replaces it
103 with its decomposed string version.
104
105 unicode_decomposition_t's string, string_size and arg are copies of
106 unicode_decomposition_init's parameters. unicode_decomposition_init
107 initializes all other fields to their default values.
108
109 The decompose_flags bitmask gets initialized to 0, and is a bit mask:
110
111 UNICODE_DECOMPOSE_FLAG_QC
112 Check each character's appropriate “quick check” property and skip
113 decomposing Unicode characters that would get re-composed by
114 unicode_composition_apply().
115
116 UNICODE_DECOMPOSE_FLAG_COMPAT
117 Perform a compatibility decomposition instead of a canonical
118 decomposition.
119
120 reallocate is a pointer to a function that gets called to reallocate a
121 larger string. unicode_decompose() determines which characters in the
122 string need decomposing and calls the reallocate function pointer zero
123 or more times. Each call to reallocate passes information about where
124 new characters will get inserted into the string.
125
126 reallocate only needs to grow the size of the buffer where string
127 points so that it's big enough to hold a larger, decomposed string;
128 then update string accordingly. reallocate should not update
129 string_size or make any changes to the existing string, that's
130 unicode_decompose()'s job (after reallocate returns).
131
132 The reallocate callback function receives the following parameters.
133
134 • A pointer to the unicode_decomposition_t and, notably, its arg.
135
136 • A pointer to the array of offset indexes in the string where new
137 characters will get inserted in order to hold the decomposed
138 string.
139
140 • A pointer to the array that holds the number of characters that get
141 inserted each corresponding offset.
142
143 • The size of the two arrays.
144
145 reallocate must update the string if necessary to hold at least the
146 number of characters that's the sum total of the initial string_size
147 and the sum total of al sizes.
148
149 unicode_decomposition_init() initializes the reallocate pointer to a
150 default implementation that uses realloc(3) and updates string with its
151 return value. The application can use its own reallocate to handle this
152 task on its own, and use unicode_decompose_reallocate_size to compute
153 the minimum string size:
154
155 size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info,
156 const size_t *sizes,
157 size_t n)
158 {
159 size_t i;
160 size_t new_size=info->string_size;
161
162 for (i=0; i<n; ++i)
163 new_size += sizes[i];
164
165 return new_size;
166 }
167
168
169 The reallocate function returns 0 on success and a non-0 error code to
170 report a failure; and unicode_decompose() does the same. The only error
171 condition from unicode_decompose() is a non-0 error code from the
172 reallocate function. Otherwise: a successful decomposition results in
173 unicode_decompose() returning 0 and unicode_decomposition_init()'s
174 string pointing to the decomposed string and string_size giving the
175 number of characters in the decomposed string.
176
177 Note
178 string_size does not include the trailing \0 character. The input
179 string also has its string_size specified without counting its \0
180 character. The default implementation of reallocate allocates an
181 extra char32_t ands sets it to a \0. Therefore:
182
183 • If the Unicode string before decomposition has a trailing \0
184 and no decomposition occurs, and no calls to reallocate takes
185 place: the string in the unicode_decomposition_t is unchanged
186 and it's still \0-terminated.
187
188 • The default reallocate allocates an extra char32_t ands sets it
189 to a \0; and it takes care of that for the decomposed string.
190
191 • An application that provides its own replacement reallocate is
192 responsible for doing the same, if it wants the decomposed
193 string to be \0 terminated.
194
195 Note
196 Multiple calls to the reallocate callback are possible. Each call
197 to reallocate reflect the prior calls' decompositions. Example: the
198 original string has five characters and the first call to
199 reallocate had two offsets, at position 1 and 3, with a value of 1
200 for their both sizes. This effects transforming an original Unicode
201 string "AAAAA" into "AXAAXAA" (with “A” representing unspecified
202 characters in the original string, and “X” showing the two
203 characters added in the first call to reallocate.
204
205 A second call to varname with am offset at position 4, and a size
206 of 1, results in the updated string of "AXAAYXAA" (with “Y”)
207 marking an unspecified character inserted by the second call.
208
209 Note
210 Unicode string decomposition involves replacing a given Unicode
211 character with one or more other characters. The sizes given to
212 reallocate reflect the net addition to the Unicode string. For
213 example: decomposing one Unicode character into three decomposed
214 characters results in a call to reallocate reporting an insert of
215 two more characters.
216
217 Note
218 offsets actually report the indices of each Unicode character
219 that's getting decomposed. A 1:1 decomposition of a Unicode
220 Character gets reported as an additional sizes entry of 0.
221
222 unicode_decomposition_deinit() releases all resources and destroys the
223 unicode_decomposition_t; it is no longer valid.
224
225 Note
226 unicode_decomposition_deinit() does not free(3) the string. The
227 original string gets passed in to unicode_decomposition_init() and
228 the decomposed string is left in the string.
229
230 The default implementation of the reallocate function assumes the
231 string is a malloc(3)-ed string, and reallocs it.
232
233 Note
234 At this time unicode_decomposition_deinit() does nothing. All code
235 should explicitly call it in order to remain forward-compatible (at
236 the source level).
237
238 unicode_compose() performs a canonical composition of a decomposed
239 string. Its parameters are:
240
241 • A pointer to the decomposed Unicode string.
242
243 • The number of characters in the Unicode string. The Unicode string
244 does not need to be \0-terminated; if it is this number does not
245 include it.
246
247 • A flags bitmask, which can have the following values:
248
249 UNICODE_COMPOSE_FLAG_REMOVEUNUSED
250 Remove all combining marks after doing all canonical
251 compositions. Normally any unused combining marks are left in
252 place, in the combined text. This option removes them.
253
254 UNICODE_COMPOSE_FLAG_ONESHOT
255 Perform canonical composition once per character, and do not
256 attempt to combine any resulting combined characters again.
257
258 • A non-NULL pointer to a size_t.
259
260 A successful composition sets this size_t to the number of
261 characters in the combined string, and returns 0. The combined
262 string gets placed back into the string parameter, this string gets
263 combined in place and this gives the size of the combined string.
264
265 unicode_compose() returns a non-zero value to indicate an error.
266
267 unicode_composition_init(), unicode_composition_apply() and
268 unicode_composition_deinit() implement a detailed interface for
269 canonical composition of a decomposed Unicode string:
270
271 unicode_compositions_t compositions;
272
273 if (unicode_composition_init(str, strsize, flags, &compositions) == 0)
274 {
275 size_t new_size=unicode_composition_apply(str, strsize, &compositions);
276
277 unicode_composition_deinit(&compositions);
278 }
279
280 The first two parameters to both unicode_composition_init() and
281 unicode_composition_apply() are the same: the Unicode string and the
282 number of characters (not including any trailing \0 character) in the
283 Unicode string.
284
285 unicode_composition_init()'s additional parameters are: any optional
286 flags (see unicode_compose() for a list of available flags), and the
287 address of a unicode_composition_t object. A non-0 return from
288 unicode_composition_init() indicates an error.
289 unicode_composition_init() indicates success by returning 0 and
290 initializing the unicode_composition_t's object which contains a
291 pointer to an array of pointers to of unicode_compose_info objects, and
292 the number of pointers. unicode_composition_init() does not change the
293 string; the only thing it does is initialize the unicode_composition_t
294 object.
295
296 unicode_composition_apply() applies the compositions to the string, in
297 place, and returns the new size of the string (also not including the
298 \0 byte, however it does append one if the composed string is smaller,
299 so the composed string is \0-terminated if the decomposed string was).
300
301 It is necessary to call unicode_composition_deinit() to free all memory
302 that was allocated for the unicode_composition_t object:
303
304 struct unicode_compose_info {
305 size_t index;
306 size_t n_composed;
307 char32_t *composition;
308 size_t n_composition;
309 };
310
311 typedef struct {
312 struct unicode_compose_info **compositions;
313 size_t n_compositions;
314 } unicode_composition_t;
315
316
317 index gives the character index in the string where each composition
318 occurs. n_composed gives the number of characters in the original
319 string that get composed. The composed characters are the composition;
320 and n_composition gives the number of composed characters.
321
322 Effectively: at the index position in the original string, #n_composed
323 characters get removed and there are #n_composition characters that
324 replace them (always n_composed or less).
325
326 Note
327 The UNICODE_COMPOSE_FLAG_REMOVEUNUSED flag has the effect of
328 including the combining marks that did not get combined in the
329 n_composed count. It's possible that, in this case, n_composition
330 is 0. This indicates complete removal of the combining marks,
331 without anything getting combined in their place.
332
333 unicode_composition_init() sets unicode_composition_t's compositions
334 pointer to an array of pointers to unicode_compose_infos that are
335 sorted according to their index. n_compositions gives the number of
336 pointers in the array, and is 0 if there are no compositions, the array
337 is empty. The empty array gets interpreted accordingly when it gets
338 passed to unicode_composition_apply() and unicode_composition_deinit():
339 nothing happens. unicode_composition_apply() simply returns the size
340 of the unchanged string, and unicode_composition_deinit() does a
341 pro-forma cleanup.
342
344 TR-15[2], courier-unicode(7), unicode::canonical(3).
345
347 Sam Varshavchik
348 Author
349
351 1. canonical and compatibility mapping
352 https://www.unicode.org/reports/tr15/tr15-50.html
353
354 2. TR-15
355 https://www.unicode.org/reports/tr15/tr15-50.html
356
357
358
359Courier Unicode Library 04/16/2022 UNICODE_CANONICAL(3)