1String(3) User Contributed Perl Documentation String(3)
2
3
4
6 Unicode::String - String of Unicode characters (UTF-16BE)
7
9 use Unicode::String qw(utf8 latin1 utf16be);
10
11 $u = utf8("string");
12 $u = latin1("string");
13 $u = utf16be("\0s\0t\0r\0i\0n\0g");
14
15 print $u->utf32be; # 4 byte characters
16 print $u->utf16le; # 2 byte characters + surrogates
17 print $u->utf8; # 1-4 byte characters
18
20 A "Unicode::String" object represents a sequence of Unicode characters.
21 Methods are provided to convert between various external formats
22 (encodings) and "Unicode::String" objects, and methods are provided for
23 common string manipulations.
24
25 The functions utf32be(), utf32le(), utf16be(), utf16le(), utf8(),
26 utf7(), latin1(), uhex(), uchr() can be imported from the "Uni‐
27 code::String" module and will work as constructors initializing strings
28 of the corresponding encoding.
29
30 The "Unicode::String" objects overload various operators, which means
31 that they in most cases can be treated like plain strings.
32
33 Internally a "Unicode::String" object is represented by a string of 2
34 byte numbers in network byte order (big-endian). This representation is
35 not visible by the API provided, but it might be useful to know in
36 order to predict the efficiency of the provided methods.
37
38 METHODS
39
40 Class methods
41
42 The following class methods are available:
43
44 Unicode::String->stringify_as
45 Unicode::String->stringify_as( $enc )
46 This method is used to specify which encoding will be used when
47 "Unicode::String" objects are implicitly converted to and from
48 plain strings.
49
50 If an argument is provided it sets the current encoding. The argu‐
51 ment should have one of the following: "ucs4", "utf32", "utf32be",
52 "utf32le", "ucs2", "utf16", "utf16be", "utf16le", "utf8", "utf7",
53 "latin1" or "hex". The default is "utf8".
54
55 The stringify_as() method returns a reference to the current encod‐
56 ing function.
57
58 $us = Unicode::String->new
59 $us = Unicode::String->new( $initial_value )
60 This is the object constructor. Without argument, it creates an
61 empty "Unicode::String" object. If an $initial_value argument is
62 given, it is decoded according to the specified stringify_as()
63 encoding, UTF-8 by default.
64
65 In general it is recommended to import and use one of the encoding
66 specific constructor functions instead of invoking this method.
67
68 Encoding methods
69
70 These methods get or set the value of the "Unicode::String" object by
71 passing strings in the corresponding encoding. If a new value is
72 passed as argument it will set the value of the "Unicode::String", and
73 the previous value is returned. If no argument is passed then the cur‐
74 rent value is returned.
75
76 To illustrate the encodings we show how the 2 character sample string
77 of "µm" (micro meter) is encoded for each one.
78
79 $us->utf32be
80 $us->utf32be( $newval )
81 The string passed should be in the UTF-32 encoding with bytes in
82 big endian order. The sample "µm" is "\0\0\0\xB5\0\0\0m" in this
83 encoding.
84
85 Alternative names for this method are utf32() and ucs4().
86
87 $us->utf32le
88 $us->utf32le( $newval )
89 The string passed should be in the UTF-32 encoding with bytes in
90 little endian order. The sample "µm" is is "\xB5\0\0\0m\0\0\0" in
91 this encoding.
92
93 $us->utf16be
94 $us->utf16be( $newval )
95 The string passed should be in the UTF-16 encoding with bytes in
96 big endian order. The sample "µm" is "\0\xB5\0m" in this encoding.
97
98 Alternative names for this method are utf16() and ucs2().
99
100 If the string passed to utf16be() starts with the Unicode byte
101 order mark in little endian order, the result is as if utf16le()
102 was called instead.
103
104 $us->utf16le
105 $us->utf16le( $newval )
106 The string passed should be in the UTF-16 encoding with bytes in
107 little endian order. The sample "µm" is is "\xB5\0m\0" in this
108 encoding. This is the encoding used by the Microsoft Windows API.
109
110 If the string passed to utf16le() starts with the Unicode byte
111 order mark in big endian order, the result is as if utf16le() was
112 called instead.
113
114 $us->utf8
115 $us->utf8( $newval )
116 The string passed should be in the UTF-8 encoding. The sample "µm"
117 is "\xC2\xB5m" in this encoding.
118
119 $us->utf7
120 $us->utf7( $newval )
121 The string passed should be in the UTF-7 encoding. The sample "µm"
122 is "+ALU-m" in this encoding.
123
124 The UTF-7 encoding only use plain US-ASCII characters for the
125 encoding. This makes it safe for transport through 8-bit stripping
126 protocols. Characters outside the US-ASCII range are
127 base64-encoded and '+' is used as an escape character. The UTF-7
128 encoding is described in RFC 1642.
129
130 If the (global) variable $Uni‐
131 code::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider
132 range of characters are encoded as themselves. It is even TRUE by
133 default. The characters affected by this are:
134
135 ! " # $ % & * ; < = > @ [ ] ^ _ ` { ⎪ }
136
137 $us->latin1
138 $us->latin1( $newval )
139 The string passed should be in the ISO-8859-1 encoding. The sample
140 "µm" is "\xB5m" in this encoding.
141
142 Characters outside the "\x00" .. "\xFF" range are simply removed
143 from the return value of the latin1() method. If you want more
144 control over the mapping from Unicode to ISO-8859-1, use the "Uni‐
145 code::Map8" class. This is also the way to deal with other 8-bit
146 character sets.
147
148 $us->hex
149 $us->hex( $newval )
150 The string passed should be plain ASCII where each Unicode charac‐
151 ter is represented by the "U+XXXX" string and separated by a single
152 space character. The "U+" prefix is optional when setting the
153 value. The sample "µm" is "U+00b5 U+006d" in this encoding.
154
155 String Operations
156
157 The following methods are available:
158
159 $us->as_string
160 Converts a "Unicode::String" to a plain string according to the
161 setting of stringify_as(). The default stringify_as() encoding is
162 "utf8".
163
164 $us->as_num
165 Converts a "Unicode::String" to a number. Currently only the dig‐
166 its in the range 0x30 .. 0x39 are recognized. The plan is to even‐
167 tually support all Unicode digit characters.
168
169 $us->as_bool
170 Converts a "Unicode::String" to a boolean value. Only the empty
171 string is FALSE. A string consisting of only the character U+0030
172 is considered TRUE, even if Perl consider "0" to be FALSE.
173
174 $us->repeat( $count )
175 Returns a new "Unicode::String" where the content of $us is
176 repeated $count times. This operation is also overloaded as:
177
178 $us x $count
179
180 $us->concat( $other_string )
181 Concatenates the string $us and the string $other_string. If
182 $other_string is not an "Unicode::String" object, then it is first
183 passed to the Unicode::String->new constructor function. This
184 operation is also overloaded as:
185
186 $us . $other_string
187
188 $us->append( $other_string )
189 Appends the string $other_string to the value of $us. If
190 $other_string is not an "Unicode::String" object, then it is first
191 passed to the Unicode::String->new constructor function. This
192 operation is also overloaded as:
193
194 $us .= $other_string
195
196 $us->copy
197 Returns a copy of the current "Unicode::String" object. This oper‐
198 ation is overloaded as the assignment operator.
199
200 $us->length
201 Returns the length of the "Unicode::String". Surrogate pairs are
202 still counted as 2.
203
204 $us->byteswap
205 This method will swap the bytes in the internal representation of
206 the "Unicode::String" object.
207
208 Unicode reserve the character U+FEFF character as a byte order
209 mark. This works because the swapped character, U+FFFE, is
210 reserved to not be valid. For strings that have the byte order
211 mark as the first character, we can guaranty to get the byte order
212 right with the following code:
213
214 $ustr->byteswap if $ustr->ord == 0xFFFE;
215
216 $us->unpack
217 Returns a list of integers each representing an UCS-2 character
218 code.
219
220 $us->pack( @uchr )
221 Sets the value of $us as a sequence of UCS-2 characters with the
222 characters codes given as parameter.
223
224 $us->ord
225 Returns the character code of the first character in $us. The
226 ord() method deals with surrogate pairs, which gives us a result-
227 range of 0x0 .. 0x10FFFF. If the $us string is empty, undef is
228 returned.
229
230 $us->chr( $code )
231 Sets the value of $us to be a string containing the character
232 assigned code $code. The argument $code must be an integer in the
233 range 0x0 .. 0x10FFFF. If the code is greater than 0xFFFF then a
234 surrogate pair created.
235
236 $us->name
237 In scalar context returns the official Unicode name of the first
238 character in $us. In array context returns the name of all charac‐
239 ters in $us. Also see Unicode::CharName.
240
241 $us->substr( $offset )
242 $us->substr( $offset, $length )
243 $us->substr( $offset, $length, $subst )
244 Returns a sub-string of $us. Works similar to the builtin substr()
245 function.
246
247 $us->index( $other )
248 $us->index( $other, $pos )
249 Locates the position of $other within $us, possibly starting the
250 search at position $pos.
251
252 $us->chop
253 Chops off the last character of $us and returns it (as a "Uni‐
254 code::String" object).
255
257 The following functions are provided. None of these are exported by
258 default.
259
260 byteswap2( $str, ... )
261 This function will swap 2 and 2 bytes in the strings passed as
262 arguments. If this function is called in void context, then it
263 will modify its arguments in-place. Otherwise, the swapped strings
264 are returned.
265
266 byteswap4( $str, ... )
267 The byteswap4 function works similar to byteswap2, but will reverse
268 the order of 4 and 4 bytes.
269
270 latin1( $str )
271 utf7( $str )
272 utf8( $str )
273 utf16le( $str )
274 utf16be( $str )
275 utf32le( $str )
276 utf32be( $str )
277 Constructor functions for the various Unicode encodings. These
278 return new "Unicode::String" objects. The provided argument should
279 be encoded correspondingly.
280
281 uhex( $str )
282 Constructs a new "Unicode::String" object from a string of hex val‐
283 ues. See hex() method above for description of the format.
284
285 uchar( $num )
286 Constructs a new one character "Unicode::String" object from a Uni‐
287 code character code. This works similar to perl's builtin chr()
288 function.
289
291 Unicode::CharName, Unicode::Map8
292
293 <http://www.unicode.org/>
294
295 perlunicode
296
298 Copyright 1997-2000,2005 Gisle Aas.
299
300 This library is free software; you can redistribute it and/or modify it
301 under the same terms as Perl itself.
302
303
304
305perl v5.8.8 2005-10-26 String(3)