1Unicode::GCString(3)  User Contributed Perl Documentation Unicode::GCString(3)
2
3
4

NAME

6       Unicode::GCString - String as Sequence of UAX #29 Grapheme Clusters
7

SYNOPSIS

9           use Unicode::GCString;
10           $gcstring = Unicode::GCString->new($string);
11

DESCRIPTION

13       Unicode::GCString treats Unicode string as a sequence of extended
14       grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
15
16       Grapheme cluster is a sequence of Unicode character(s) that consists of
17       one grapheme base and optional grapheme extender and/or “prepend”
18       character.  It is close in that people consider as character.
19
20   Public Interface
21       Constructors
22
23       new (STRING, [KEY => VALUE, ...])
24       new (STRING, [LINEBREAK])
25           Constructor.  Create new grapheme cluster string (Unicode::GCString
26           object) from Unicode string STRING.
27
28           About optional KEY => VALUE pairs see "Options" in
29           Unicode::LineBreak.  On second form, Unicode::LineBreak object
30           LINEBREAK controls breaking features.
31
32           Note: The first form was introduced by release 2012.10.
33
34       copy
35           Copy constructor.  Create a copy of grapheme cluster string.  Next
36           position of new string is set at beginning.
37
38       Sizes
39
40       chars
41           Instance method.  Returns number of Unicode characters grapheme
42           cluster string includes, i.e. length as Unicode string.
43
44       columns
45           Instance method.  Returns total number of columns of grapheme
46           clusters defined by built-in character database.  For more details
47           see "DESCRIPTION" in Unicode::LineBreak.
48
49       length
50           Instance method.  Returns number of grapheme clusters contained in
51           grapheme cluster string.
52
53       Operations as String
54
55       as_string
56       """OBJECT"""
57           Instance method.  Convert grapheme cluster string to Unicode string
58           explicitly.
59
60       cmp (STRING)
61       STRING "cmp" STRING
62           Instance method.  Compare strings.  There are no oddities.  One of
63           each STRING may be Unicode string.
64
65       concat (STRING)
66       STRING "." STRING
67           Instance method.  Concatenate STRINGs.  One of each STRING may be
68           Unicode string.  Note that number of columns (see columns()) or
69           grapheme clusters (see length()) of resulting string is not always
70           equal to sum of both strings.  Next position of new string is that
71           set on the left value.
72
73       join ([STRING, ...])
74           Instance method.  Join STRINGs inserting grapheme cluster string.
75           Any of STRINGs may be Unicode string.
76
77       substr (OFFSET, [LENGTH, [REPLACEMENT]])
78           Instance method.  Returns substring of grapheme cluster string.
79           OFFSET and LENGTH are based on grapheme clusters.  If REPLACEMENT
80           is specified, substring is replaced by it.  REPLACEMENT may be
81           Unicode string.
82
83           Note: This method cannot return the lvalue, unlike built-in
84           substr().
85
86       Operations as Sequence of Grapheme Clusters
87
88       as_array
89       "@{"OBJECT"}"
90       as_arrayref
91           Instance method.  Convert grapheme cluster string to an array of
92           grapheme clusters.
93
94       eos Instance method.  Test if current position is at end of grapheme
95           cluster string.
96
97       item ([OFFSET])
98           Instance method.  Returns OFFSET-th grapheme cluster.  If OFFSET
99           was not specified, returns next grapheme cluster.
100
101       next
102       "<"OBJECT">"
103           Instance method, iterative.  Returns next grapheme cluster and
104           increment next position.
105
106       pos ([OFFSET])
107           Instance method.  If optional OFFSET is specified, set next
108           position by it.  Returns next position of grapheme cluster string.
109
110       Miscelaneous
111
112       lbc Instance method.  Returns Line Breaking Class (See
113           Unicode::LineBreak) of the first character of first grapheme
114           cluster.
115
116       lbcext
117           Instance method.  Returns Line Breaking Class (See
118           Unicode::LineBreak) of the last grapheme extender of last grapheme
119           cluster.  If there are no grapheme extenders or its class is CM,
120           value of last grapheme base will be returned.
121

CAVEATS

123       •   The grapheme cluster should not be referred to as "grapheme" even
124           though Larry does.
125
126       •   On Perl around 5.10.1, implicit conversion from Unicode::GCString
127           object to Unicode string sometimes let "utf8_mg_pos_cache_update"
128           cache be confused.
129
130           For example, instead of doing
131
132               $sub = substr($gcstring, $i, $j);
133
134           do
135
136               $sub = substr("$gcstring", $i, $j);
137
138               $sub = substr($gcstring->as_string, $i, $j);
139
140       •   This module implements default algorithm for determining grapheme
141           cluster boundaries.  Tailoring mechanism has not been supported
142           yet.
143

VERSION

145       Consult $VERSION variable.
146
147   Incompatible Changes
148       Release 2013.10
149           •   The new() method can take non-Unicode string argument.  In this
150               case it will be decoded by iso-8859-1 (Latin 1) character set.
151               That method of former releases would die with non-Unicode
152               inputs.
153

SEE ALSO

155       [UAX #29] Mark Davis (ed.) (2009-2013).  Unicode Standard Annex #29:
156       Unicode Text Segmentation, Revisions 15-23.
157       <http://www.unicode.org/reports/tr29/>.
158

AUTHOR

160       Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>
161
163       Copyright (C) 2009-2013 Hatuka*nezumi - IKEDA Soji.
164
165       This program is free software; you can redistribute it and/or modify it
166       under the same terms as Perl itself.
167
168
169
170perl v5.36.0                      2023-01-20              Unicode::GCString(3)
Impressum