1Unicode::GCString(3) User Contributed Perl Documentation Unicode::GCString(3)
2
3
4
6 Unicode::GCString - String as Sequence of UAX #29 Grapheme Clusters
7
9 use Unicode::GCString;
10 $gcstring = Unicode::GCString->new($string);
11
13 Unicode::GCString treats Unicode string as a sequence of extended
14 grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
15
16 Grapheme cluster is a sequence of Unicode character(s) that consists of
17 one grapheme base and optional grapheme extender and/or “prepend”
18 character. It is close in that people consider as character.
19
20 Public Interface
21 Constructors
22
23 new (STRING, [KEY => VALUE, ...])
24 new (STRING, [LINEBREAK])
25 Constructor. Create new grapheme cluster string (Unicode::GCString
26 object) from Unicode string STRING.
27
28 About optional KEY => VALUE pairs see "Options" in
29 Unicode::LineBreak. On second form, Unicode::LineBreak object
30 LINEBREAK controls breaking features.
31
32 Note: The first form was introduced by release 2012.10.
33
34 copy
35 Copy constructor. Create a copy of grapheme cluster string. Next
36 position of new string is set at beginning.
37
38 Sizes
39
40 chars
41 Instance method. Returns number of Unicode characters grapheme
42 cluster string includes, i.e. length as Unicode string.
43
44 columns
45 Instance method. Returns total number of columns of grapheme
46 clusters defined by built-in character database. For more details
47 see "DESCRIPTION" in Unicode::LineBreak.
48
49 length
50 Instance method. Returns number of grapheme clusters contained in
51 grapheme cluster string.
52
53 Operations as String
54
55 as_string
56 """OBJECT"""
57 Instance method. Convert grapheme cluster string to Unicode string
58 explicitly.
59
60 cmp (STRING)
61 STRING "cmp" STRING
62 Instance method. Compare strings. There are no oddities. One of
63 each STRING may be Unicode string.
64
65 concat (STRING)
66 STRING "." STRING
67 Instance method. Concatenate STRINGs. One of each STRING may be
68 Unicode string. Note that number of columns (see columns()) or
69 grapheme clusters (see length()) of resulting string is not always
70 equal to sum of both strings. Next position of new string is that
71 set on the left value.
72
73 join ([STRING, ...])
74 Instance method. Join STRINGs inserting grapheme cluster string.
75 Any of STRINGs may be Unicode string.
76
77 substr (OFFSET, [LENGTH, [REPLACEMENT]])
78 Instance method. Returns substring of grapheme cluster string.
79 OFFSET and LENGTH are based on grapheme clusters. If REPLACEMENT
80 is specified, substring is replaced by it. REPLACEMENT may be
81 Unicode string.
82
83 Note: This method cannot return the lvalue, unlike built-in
84 substr().
85
86 Operations as Sequence of Grapheme Clusters
87
88 as_array
89 "@{"OBJECT"}"
90 as_arrayref
91 Instance method. Convert grapheme cluster string to an array of
92 grapheme clusters.
93
94 eos Instance method. Test if current position is at end of grapheme
95 cluster string.
96
97 item ([OFFSET])
98 Instance method. Returns OFFSET-th grapheme cluster. If OFFSET
99 was not specified, returns next grapheme cluster.
100
101 next
102 "<"OBJECT">"
103 Instance method, iterative. Returns next grapheme cluster and
104 increment next position.
105
106 pos ([OFFSET])
107 Instance method. If optional OFFSET is specified, set next
108 position by it. Returns next position of grapheme cluster string.
109
110 Miscelaneous
111
112 lbc Instance method. Returns Line Breaking Class (See
113 Unicode::LineBreak) of the first character of first grapheme
114 cluster.
115
116 lbcext
117 Instance method. Returns Line Breaking Class (See
118 Unicode::LineBreak) of the last grapheme extender of last grapheme
119 cluster. If there are no grapheme extenders or its class is CM,
120 value of last grapheme base will be returned.
121
123 • The grapheme cluster should not be referred to as "grapheme" even
124 though Larry does.
125
126 • On Perl around 5.10.1, implicit conversion from Unicode::GCString
127 object to Unicode string sometimes let "utf8_mg_pos_cache_update"
128 cache be confused.
129
130 For example, instead of doing
131
132 $sub = substr($gcstring, $i, $j);
133
134 do
135
136 $sub = substr("$gcstring", $i, $j);
137
138 $sub = substr($gcstring->as_string, $i, $j);
139
140 • This module implements default algorithm for determining grapheme
141 cluster boundaries. Tailoring mechanism has not been supported
142 yet.
143
145 Consult $VERSION variable.
146
147 Incompatible Changes
148 Release 2013.10
149 • The new() method can take non-Unicode string argument. In this
150 case it will be decoded by iso-8859-1 (Latin 1) character set.
151 That method of former releases would die with non-Unicode
152 inputs.
153
155 [UAX #29] Mark Davis (ed.) (2009-2013). Unicode Standard Annex #29:
156 Unicode Text Segmentation, Revisions 15-23.
157 <http://www.unicode.org/reports/tr29/>.
158
160 Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>
161
163 Copyright (C) 2009-2013 Hatuka*nezumi - IKEDA Soji.
164
165 This program is free software; you can redistribute it and/or modify it
166 under the same terms as Perl itself.
167
168
169
170perl v5.36.0 2022-07-22 Unicode::GCString(3)