unicode_wbscan_init(3)

1UNICODE_WORD_BREAK(3)       Courier Unicode Library      UNICODE_WORD_BREAK(3)
2
3
4

NAME

6       unicode_wb_init, unicode_wb_next, unicode_wb_next_cnt, unicode_wb_end,
7       unicode_wbscan_init, unicode_wbscan_next, unicode_wbscan_end,
8       unicode_word_break - calculate word breaks
9

SYNOPSIS

11       #include <courier-unicode.h>
12
13       unicode_wb_info_t unicode_wb_init(int (*cb_func)(int, void *),
14                                         void *cb_arg);
15
16       int unicode_wb_next(unicode_wb_info_t wb, char32_t c);
17
18       int unicode_wb_next_cnt(unicode_wb_info_t wb, const char32_t *cptr,
19                               size_t cnt);
20
21       int unicode_wb_end(unicode_wb_info_t wb);
22
23       unicode_wbscan_info_t unicode_wbscan_init(void);
24
25       int unicode_wbscan_next(unicode_wbscan_info_t wbs, char32_t c);
26
27       size_t unicode_wbscan_end(unicode_wbscan_info_t wbs);
28

DESCRIPTION

30       These functions implement the unicode word breaking algorithm. Invoke
31       unicode_wb_init() to initialize the word breaking algorithm. The first
32       parameter is a callback function. The second parameter is an opaque
33       pointer. The callback function gets invoked with two parameters. The
34       second parameter is the opaque pointer that was given to
35       unicode_wb_init(); and the opaque pointer is not subject to any further
36       interpretation by these functions.
37
38       unicode_wb_init() returns an opaque handle. Repeated invocations of
39       unicode_wb_next(), passing the handle, and one unicode character
40       defines a sequence of unicode characters over which the word breaking
41       algorithm calculation takes place.  unicode_wb_next_cnt() is a shortcut
42       for invoking unicode_wb_next() repeatedly over an array cptr containing
43       cnt unicode characters.
44
45       unicode_wb_end() denotes the end of the unicode character sequence.
46       After the call to unicode_wb_end() the word breaking unicode_wb_info_t
47       handle is no longer valid.
48
49       Between the call to unicode_wb_init() and unicode_wb_end(), the
50       callback function gets invoked exactly once for each unicode character
51       given to unicode_wb_next() or unicode_wb_next_cnt(). Usually each call
52       to unicode_wb_next() results in the callback function getting invoked
53       immediately, but it does not have to be. It's possible that a call to
54       unicode_wb_next() returns without invoking the callback function, and
55       some subsequent call to unicode_wb_next() (or unicode_wb_end()) invokes
56       the callback function more than once, to catch things up. The contract
57       is that before unicode_wb_end() returns, the callback function gets
58       invoked the exact number of times as the number of characters in the
59       unicode sequence defined by the intervening calls to unicode_wb_next()
60       and unicode_wb_next_cnt(), unless an error occurs.
61
62       Each call to the callback function reports the calculated wordbreaking
63       status of the corresponding character in the unicode character
64       sequence. If the parameter to the callback function is non zero, a word
65       break is permitted before the corresponding character. A zero value
66       indicates that a word break is prohibited before the corresponding
67       character.
68
69       The callback function should return 0. A non-zero value indicates to
70       the word breaking algorithm that an error has occurred.
71       unicode_wb_next() and unicode_wb_next_cnt() return zero either if they
72       never invoked the callback function, or if each call to the callback
73       function returned zero. A non zero return from the callback function
74       results in unicode_wb_next() and unicode_wb_next_cnt() immediately
75       returning the same value.
76
77       unicode_wb_end() must be invoked to destroy the word breaking handle
78       even if unicode_wb_next() and unicode_wb_next_cnt() returned an error
79       indication. It's also possible that, under normal circumstances,
80       unicode_wb_end() invokes the callback function one or more times. The
81       return value from unicode_wb_end() has the same meaning as from
82       unicode_wb_next() and unicode_wb_next_cnt(); however in all cases after
83       unicode_wb_end() returns the line breaking handle is no longer valid.
84
85   Word scan
86       unicode_wbscan_init(), unicode_wbscan_next() and unicode_wbscan_end
87       scan for the next word boundary in a unicode character sequence.
88       unicode_wbscan_init() obtains a handle, then unicode_wbscan_next() gets
89       repeatedly invoked to define the unicode character sequence.
90       unicode_wbscan_end() deallocates the handle and returns the number of
91       leading characters in the unicode character sequence up to the first
92       word break.
93
94       A non-0 return value from unicode_wbscan_next() indicates that the word
95       boundary is already known, and any further calls to
96       unicode_wbscan_next() will be ignored.  unicode_wbscan_end() must still
97       be called, to obtain the unicode character count.
98

AUTHOR

105       Sam Varshavchik
106           Author
107

NOTES

109        1. TR-29
110           https://www.unicode.org/reports/tr29/tr29-37.html
111
112
113
114Courier Unicode Library           05/31/2022             UNICODE_WORD_BREAK(3)

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO

AUTHOR

NOTES