1UNICODE_WORD_BREAK(3) Courier Unicode Library UNICODE_WORD_BREAK(3)
2
3
4
6 unicode_wb_init, unicode_wb_next, unicode_wb_next_cnt, unicode_wb_end,
7 unicode_wbscan_init, unicode_wbscan_next, unicode_wbscan_end -
8 calculate word breaks
9
11 #include <courier-unicode.h>
12
13 unicode_wb_info_t unicode_wb_init(int (*cb_func)(int, void *),
14 void *cb_arg);
15
16 int unicode_wb_next(unicode_wb_info_t wb, char32_t c);
17
18 int unicode_wb_next_cnt(unicode_wb_info_t wb, const char32_t *cptr,
19 size_t cnt);
20
21 int unicode_wb_end(unicode_wb_info_t wb);
22
23 unicode_wbscan_info_t unicode_wbscan_init(void);
24
25 int unicode_wbscan_next(unicode_wbscan_info_t wbs, char32_t c);
26
27 size_t unicode_wbscan_end(unicode_wbscan_info_t wbs);
28
30 These functions implement the unicode word breaking algorithm. Invoke
31 unicode_wb_init() to initialize the word breaking algorithm. The first
32 parameter is a callback function. The second parameter is an opaque
33 pointer. The callback function gets invoked with two parameters. The
34 second parameter is the opaque pointer that was given to
35 unicode_wb_init(); and the opaque pointer is not subject to any further
36 interpretation by these functions.
37
38 unicode_wb_init() returns an opaque handle. Repeated invocations of
39 unicode_wb_next(), passing the handle, and one unicode character
40 defines a sequence of unicode characters over which the word breaking
41 algorithm calculation takes place. unicode_wb_next_cnt() is a shortcut
42 for invoking unicode_wb_next() repeatedly over an array cptr containing
43 cnt unicode characters.
44
45 unicode_wb_end() denotes the end of the unicode character sequence.
46 After the call to unicode_wb_end() the word breaking unicode_wb_info_t
47 handle is no longer valid.
48
49 Between the call to unicode_wb_init() and unicode_wb_end(), the
50 callback function gets invoked exactly once for each unicode character
51 given to unicode_wb_next() or unicode_wb_next_cnt(). Usually each call
52 to unicode_wb_next() results in the callback function getting invoked
53 immediately, but it does not have to be. It's possible that a call to
54 unicode_wb_next() returns without invoking the callback function, and
55 some subsequent call to unicode_wb_next() (or unicode_wb_end()) invokes
56 the callback function more than once, to catch things up. The contract
57 is that before unicode_wb_end() returns, the callback function gets
58 invoked the exact number of times as the number of characters in the
59 unicode sequence defined by the intervening calls to unicode_wb_next()
60 and unicode_wb_next_cnt(), unless an error occurs.
61
62 Each call to the callback function reports the calculated wordbreaking
63 status of the corresponding character in the unicode character
64 sequence. If the parameter to the callback function is non zero, a word
65 break is permitted before the corresponding character. A zero value
66 indicates that a word break is prohibited before the corresponding
67 character.
68
69 The callback function should return 0. A non-zero value indicates to
70 the word breaking algorithm that an error has occured.
71 unicode_wb_next() and unicode_wb_next_cnt() return zero either if they
72 never invoked the callback function, or if each call to the callback
73 function returned zero. A non zero return from the callback function
74 results in unicode_wb_next() and unicode_wb_next_cnt() immediately
75 returning the same value.
76
77 unicode_wb_end() must be invoked to destroy the word breaking handle
78 even if unicode_wb_next() and unicode_wb_next_cnt() returned an error
79 indication. It's also possible that, under normal circumstances,
80 unicode_wb_end() invokes the callback function one or more times. The
81 return value from unicode_wb_end() has the same meaning as from
82 unicode_wb_next() and unicode_wb_next_cnt(); however in all cases after
83 unicode_wb_end() returns the line breaking handle is no longer valid.
84
85 Word scan
86 unicode_wbscan_init(), unicode_wbscan_next() and unicode_wbscan_end
87 scan for the next word boundary in a unicode character sequence.
88 unicode_wbscan_init() obtains a handle, then unicode_wbscan_next() gets
89 repeatedly invoked to define the unicode character sequence.
90 unicode_wbscan_end() deallocates the handle and returns the number of
91 leading characters in the unicode character sequence up to the first
92 word break.
93
94 A non-0 return value from unicode_wbscan_next() indicates that the word
95 boundary is already known, and any further calls to
96 unicode_wbscan_next() will be ignored. unicode_wbscan_end() must still
97 be called, to obtain the unicode character count.
98
100 TR-29[1], courier-unicode(7), unicode::wordbreak(3),
101 unicode_convert_tocase(3), unicode_line_break(3),
102 unicode_grapheme_break(3).
103
105 Sam Varshavchik
106 Author
107
109 1. TR-29
110 http://www.unicode.org/reports/tr29/tr29-27.html
111
112
113
114Courier Unicode Library 03/11/2017 UNICODE_WORD_BREAK(3)