unicode_decomposition_init(3)

1UNICODE_CANONICAL(3)        Courier Unicode Library       UNICODE_CANONICAL(3)
2
3
4

NAME

6       unicode_canonical, unicode_ccc, unicode_decomposition_init,
7       unicode_decomposition_deinit, unicode_decompose,
8       unicode_decompose_reallocate_size, unicode_compose,
9       unicode_composition_init, unicode_composition_deinit,
10       unicode_composition_apply - unicode canonical normalization and
11       denormalization
12

SYNOPSIS

14       #include <courier-unicode.h>
15
16       unicode_canonical_t unicode_canonical(char32_t c);
17
18       uint8_t unicode_ccc(char32_t c);
19
20       void unicode_decomposition_init(unicode_decomposition_t *info,
21                                       char32_t *string, size_t *string_size,
22                                       void *arg);
23
24       int unicode_decompose(unicode_decomposition_t *info);
25
26       void unicode_decomposition_deinit(unicode_decomposition_t *info);
27
28       size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info,
29                                                const size_t *sizes,
30                                                size_t n);
31
32       int unicode_compose(char32_t *string, size_t string_size, int flags,
33                           size_t *new_size);
34
35       int unicode_composition_init(const char32_t *string,
36                                    size_t string_size, int flags,
37                                    unicode_composition_t *compositions);
38
39       void unicode_composition_deinit(unicode_composition_t *compositions);
40
41       size_t unicode_composition_apply(char32_t *string, size_t string_size,
42                                        unicode_composition_t *compositions);
43

DESCRIPTION

45       These functions compose or decompose a Unicode string into a canonical
46       or a compatible normalized form.
47
48       unicode_canonical() looks up the character's canonical and
49       compatibility mapping[1].  unicode_canonical() returns a structure with
50       the following fields:
51
52       canonical_chars
53           A pointer to the canonical or equivalent representation of the
54           character.
55
56       n_canonical_chars
57           Number of characters in the canonical_chars.
58
59       format
60           A value of UNICODE_CANONICAL_FMT_NONE indicates a canonical
61           mapping, other values indicate a compatibility equivalent mapping.
62
63       A NULL canonical_chars (with a 0 n_canonical_chars) indicates that the
64       character has no canonical or compatibility equivalence.
65
66       unicode_ccc() returns the character's canonical combining class value.
67
68       unicode_decomposition_init(), unicode_decompose() and
69       unicode_decomposition_deinit() implement a complete interface for
70       decomposing a Unicode string:
71
72           unicode_decomposition_t info;
73
74           unicode_decomposition_init(&info, before, (size_t)-1, NULL);
75           info.decompose_flags=UNICODE_DECOMPOSE_FLAG_QC;
76           unicode_decompose(&info);
77           unicode_decomposition_deinit(&info);
78
79       unicode_decomposition_init() initializes a new unicode_decomposition_t
80       structure, that gets passed in as its first parameter. The second
81       parameter is a pointer to a Unicode string, with the number of
82       characters in the string in the third parameter. A string size of -1
83       indicates a \0-terminated string and calculates its string_size (which
84       does not include the trailing \0. The last parameter is a void *, an
85       opaque pointer that gets stored in the initialized
86       unicode_decomposition_t object:
87
88       typedef struct unicode_decomposition {
89           char32_t   *string;
90           size_t     string_size;
91           int        decompose_flags;
92           int        (*reallocate)(
93                           struct unicode_decomposition   *info,
94                           const size_t                   *offsets,
95                           const size_t                   *sizes,
96                           size_t                         n
97                      );
98           void       *arg;
99       } unicode_decomposition_t;
100
101
102       unicode_decompose() proceeds and decomposes the string and replaces it
103       with its decomposed string version.
104
105       unicode_decomposition_t's string, string_size and arg are copies of
106       unicode_decomposition_init's parameters.  unicode_decomposition_init
107       initializes all other fields to their default values.
108
109       The decompose_flags bitmask gets initialized to 0, and is a bit mask:
110
111       UNICODE_DECOMPOSE_FLAG_QC
112           Check each character's appropriate “quick check” property and skip
113           decomposing Unicode characters that would get re-composed by
114           unicode_composition_apply().
115
116       UNICODE_DECOMPOSE_FLAG_COMPAT
117           Perform a compatibility decomposition instead of a canonical
118           decomposition.
119
120       reallocate is a pointer to a function that gets called to reallocate a
121       larger string.  unicode_decompose() determines which characters in the
122       string need decomposing and calls the reallocate function pointer zero
123       or more times. Each call to reallocate passes information about where
124       new characters will get inserted into the string.
125
126       reallocate only needs to grow the size of the buffer where string
127       points so that it's big enough to hold a larger, decomposed string;
128       then update string accordingly.  reallocate should not update
129       string_size or make any changes to the existing string, that's
130       unicode_decompose()'s job (after reallocate returns).
131
132       The reallocate callback function receives the following parameters.
133
134       •   A pointer to the unicode_decomposition_t and, notably, its arg.
135
136       •   A pointer to the array of offset indexes in the string where new
137           characters will get inserted in order to hold the decomposed
138           string.
139
140       •   A pointer to the array that holds the number of characters that get
141           inserted each corresponding offset.
142
143       •   The size of the two arrays.
144
145       reallocate must update the string if necessary to hold at least the
146       number of characters that's the sum total of the initial string_size
147       and the sum total of al sizes.
148
149       unicode_decomposition_init() initializes the reallocate pointer to a
150       default implementation that uses realloc(3) and updates string with its
151       return value. The application can use its own reallocate to handle this
152       task on its own, and use unicode_decompose_reallocate_size to compute
153       the minimum string size:
154
155           size_t unicode_decompose_reallocate_size(unicode_decomposition_t *info,
156                                                    const size_t *sizes,
157                                                    size_t n)
158           {
159               size_t i;
160               size_t new_size=info->string_size;
161
162               for (i=0; i<n; ++i)
163                   new_size += sizes[i];
164
165               return new_size;
166           }
167
168
169       The reallocate function returns 0 on success and a non-0 error code to
170       report a failure; and unicode_decompose() does the same. The only error
171       condition from unicode_decompose() is a non-0 error code from the
172       reallocate function. Otherwise: a successful decomposition results in
173       unicode_decompose() returning 0 and unicode_decomposition_init()'s
174       string pointing to the decomposed string and string_size giving the
175       number of characters in the decomposed string.
176
177           Note
178           string_size does not include the trailing \0 character. The input
179           string also has its string_size specified without counting its \0
180           character. The default implementation of reallocate allocates an
181           extra char32_t ands sets it to a \0. Therefore:
182
183           •   If the Unicode string before decomposition has a trailing \0
184               and no decomposition occurs, and no calls to reallocate takes
185               place: the string in the unicode_decomposition_t is unchanged
186               and it's still \0-terminated.
187
188           •   The default reallocate allocates an extra char32_t ands sets it
189               to a \0; and it takes care of that for the decomposed string.
190
191           •   An application that provides its own replacement reallocate is
192               responsible for doing the same, if it wants the decomposed
193               string to be \0 terminated.
194
195           Note
196           Multiple calls to the reallocate callback are possible. Each call
197           to reallocate reflect the prior calls' decompositions. Example: the
198           original string has five characters and the first call to
199           reallocate had two offsets, at position 1 and 3, with a value of 1
200           for their both sizes. This effects transforming an original Unicode
201           string "AAAAA" into "AXAAXAA" (with “A” representing unspecified
202           characters in the original string, and “X” showing the two
203           characters added in the first call to reallocate.
204
205           A second call to varname with am offset at position 4, and a size
206           of 1, results in the updated string of "AXAAYXAA" (with “Y”)
207           marking an unspecified character inserted by the second call.
208
209           Note
210           Unicode string decomposition involves replacing a given Unicode
211           character with one or more other characters. The sizes given to
212           reallocate reflect the net addition to the Unicode string. For
213           example: decomposing one Unicode character into three decomposed
214           characters results in a call to reallocate reporting an insert of
215           two more characters.
216
217           Note
218           offsets actually report the indices of each Unicode character
219           that's getting decomposed. A 1:1 decomposition of a Unicode
220           Character gets reported as an additional sizes entry of 0.
221
222       unicode_decomposition_deinit() releases all resources and destroys the
223       unicode_decomposition_t; it is no longer valid.
224
225           Note
226           unicode_decomposition_deinit() does not free(3) the string. The
227           original string gets passed in to unicode_decomposition_init() and
228           the decomposed string is left in the string.
229
230       The default implementation of the reallocate function assumes the
231       string is a malloc(3)-ed string, and reallocs it.
232
233           Note
234           At this time unicode_decomposition_deinit() does nothing. All code
235           should explicitly call it in order to remain forward-compatible (at
236           the source level).
237
238       unicode_compose() performs a canonical composition of a decomposed
239       string. Its parameters are:
240
241       •   A pointer to the decomposed Unicode string.
242
243       •   The number of characters in the Unicode string. The Unicode string
244           does not need to be \0-terminated; if it is this number does not
245           include it.
246
247       •   A flags bitmask, which can have the following values:
248
249           UNICODE_COMPOSE_FLAG_REMOVEUNUSED
250               Remove all combining marks after doing all canonical
251               compositions. Normally any unused combining marks are left in
252               place, in the combined text. This option removes them.
253
254           UNICODE_COMPOSE_FLAG_ONESHOT
255               Perform canonical composition once per character, and do not
256               attempt to combine any resulting combined characters again.
257
258       •   A non-NULL pointer to a size_t.
259
260           A successful composition sets this size_t to the number of
261           characters in the combined string, and returns 0. The combined
262           string gets placed back into the string parameter, this string gets
263           combined in place and this gives the size of the combined string.
264
265           unicode_compose() returns a non-zero value to indicate an error.
266
267       unicode_composition_init(), unicode_composition_apply() and
268       unicode_composition_deinit() implement a detailed interface for
269       canonical composition of a decomposed Unicode string:
270
271           unicode_compositions_t compositions;
272
273           if (unicode_composition_init(str, strsize, flags, &compositions) == 0)
274           {
275               size_t new_size=unicode_composition_apply(str, strsize, &compositions);
276
277               unicode_composition_deinit(&compositions);
278           }
279
280       The first two parameters to both unicode_composition_init() and
281       unicode_composition_apply() are the same: the Unicode string and the
282       number of characters (not including any trailing \0 character) in the
283       Unicode string.
284
285       unicode_composition_init()'s additional parameters are: any optional
286       flags (see unicode_compose() for a list of available flags), and the
287       address of a unicode_composition_t object. A non-0 return from
288       unicode_composition_init() indicates an error.
289       unicode_composition_init() indicates success by returning 0 and
290       initializing the unicode_composition_t's object which contains a
291       pointer to an array of pointers to of unicode_compose_info objects, and
292       the number of pointers.  unicode_composition_init() does not change the
293       string; the only thing it does is initialize the unicode_composition_t
294       object.
295
296       unicode_composition_apply() applies the compositions to the string, in
297       place, and returns the new size of the string (also not including the
298       \0 byte, however it does append one if the composed string is smaller,
299       so the composed string is \0-terminated if the decomposed string was).
300
301       It is necessary to call unicode_composition_deinit() to free all memory
302       that was allocated for the unicode_composition_t object:
303
304       struct unicode_compose_info {
305           size_t                        index;
306           size_t                        n_composed;
307           char32_t                      *composition;
308           size_t                        n_composition;
309       };
310
311       typedef struct {
312           struct unicode_compose_info   **compositions;
313           size_t                        n_compositions;
314       } unicode_composition_t;
315
316
317       index gives the character index in the string where each composition
318       occurs.  n_composed gives the number of characters in the original
319       string that get composed. The composed characters are the composition;
320       and n_composition gives the number of composed characters.
321
322       Effectively: at the index position in the original string, #n_composed
323       characters get removed and there are #n_composition characters that
324       replace them (always n_composed or less).
325
326           Note
327           The UNICODE_COMPOSE_FLAG_REMOVEUNUSED flag has the effect of
328           including the combining marks that did not get combined in the
329           n_composed count. It's possible that, in this case, n_composition
330           is 0. This indicates complete removal of the combining marks,
331           without anything getting combined in their place.
332
333       unicode_composition_init() sets unicode_composition_t's compositions
334       pointer to an array of pointers to unicode_compose_infos that are
335       sorted according to their index.  n_compositions gives the number of
336       pointers in the array, and is 0 if there are no compositions, the array
337       is empty. The empty array gets interpreted accordingly when it gets
338       passed to unicode_composition_apply() and unicode_composition_deinit():
339       nothing happens.  unicode_composition_apply() simply returns the size
340       of the unchanged string, and unicode_composition_deinit() does a
341       pro-forma cleanup.
342

AUTHOR

347       Sam Varshavchik
348           Author
349

NOTES

351        1. canonical and compatibility mapping
352           https://www.unicode.org/reports/tr15/tr15-50.html
353
354        2. TR-15
355           https://www.unicode.org/reports/tr15/tr15-50.html
356
357
358
359Courier Unicode Library           05/31/2022              UNICODE_CANONICAL(3)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR

NOTES