1String(3) User Contributed Perl Documentation String(3)
2
3
4
6 Unicode::String - String of Unicode characters (UTF-16BE)
7
9 use Unicode::String qw(utf8 latin1 utf16be);
10
11 $u = utf8("string");
12 $u = latin1("string");
13 $u = utf16be("\0s\0t\0r\0i\0n\0g");
14
15 print $u->utf32be; # 4 byte characters
16 print $u->utf16le; # 2 byte characters + surrogates
17 print $u->utf8; # 1-4 byte characters
18
20 A "Unicode::String" object represents a sequence of Unicode characters.
21 Methods are provided to convert between various external formats
22 (encodings) and "Unicode::String" objects, and methods are provided for
23 common string manipulations.
24
25 The functions utf32be(), utf32le(), utf16be(), utf16le(), utf8(),
26 utf7(), latin1(), uhex(), uchr() can be imported from the
27 "Unicode::String" module and will work as constructors initializing
28 strings of the corresponding encoding.
29
30 The "Unicode::String" objects overload various operators, which means
31 that they in most cases can be treated like plain strings.
32
33 Internally a "Unicode::String" object is represented by a string of 2
34 byte numbers in network byte order (big-endian). This representation is
35 not visible by the API provided, but it might be useful to know in
36 order to predict the efficiency of the provided methods.
37
38 METHODS
39 Class methods
40 The following class methods are available:
41
42 Unicode::String->stringify_as
43 Unicode::String->stringify_as( $enc )
44 This method is used to specify which encoding will be used when
45 "Unicode::String" objects are implicitly converted to and from
46 plain strings.
47
48 If an argument is provided it sets the current encoding. The
49 argument should have one of the following: "ucs4", "utf32",
50 "utf32be", "utf32le", "ucs2", "utf16", "utf16be", "utf16le",
51 "utf8", "utf7", "latin1" or "hex". The default is "utf8".
52
53 The stringify_as() method returns a reference to the current
54 encoding function.
55
56 $us = Unicode::String->new
57 $us = Unicode::String->new( $initial_value )
58 This is the object constructor. Without argument, it creates an
59 empty "Unicode::String" object. If an $initial_value argument is
60 given, it is decoded according to the specified stringify_as()
61 encoding, UTF-8 by default.
62
63 In general it is recommended to import and use one of the encoding
64 specific constructor functions instead of invoking this method.
65
66 Encoding methods
67 These methods get or set the value of the "Unicode::String" object by
68 passing strings in the corresponding encoding. If a new value is
69 passed as argument it will set the value of the "Unicode::String", and
70 the previous value is returned. If no argument is passed then the
71 current value is returned.
72
73 To illustrate the encodings we show how the 2 character sample string
74 of "µm" (micro meter) is encoded for each one.
75
76 $us->utf32be
77 $us->utf32be( $newval )
78 The string passed should be in the UTF-32 encoding with bytes in
79 big endian order. The sample "µm" is "\0\0\0\xB5\0\0\0m" in this
80 encoding.
81
82 Alternative names for this method are utf32() and ucs4().
83
84 $us->utf32le
85 $us->utf32le( $newval )
86 The string passed should be in the UTF-32 encoding with bytes in
87 little endian order. The sample "µm" is is "\xB5\0\0\0m\0\0\0" in
88 this encoding.
89
90 $us->utf16be
91 $us->utf16be( $newval )
92 The string passed should be in the UTF-16 encoding with bytes in
93 big endian order. The sample "µm" is "\0\xB5\0m" in this encoding.
94
95 Alternative names for this method are utf16() and ucs2().
96
97 If the string passed to utf16be() starts with the Unicode byte
98 order mark in little endian order, the result is as if utf16le()
99 was called instead.
100
101 $us->utf16le
102 $us->utf16le( $newval )
103 The string passed should be in the UTF-16 encoding with bytes in
104 little endian order. The sample "µm" is is "\xB5\0m\0" in this
105 encoding. This is the encoding used by the Microsoft Windows API.
106
107 If the string passed to utf16le() starts with the Unicode byte
108 order mark in big endian order, the result is as if utf16le() was
109 called instead.
110
111 $us->utf8
112 $us->utf8( $newval )
113 The string passed should be in the UTF-8 encoding. The sample "µm"
114 is "\xC2\xB5m" in this encoding.
115
116 $us->utf7
117 $us->utf7( $newval )
118 The string passed should be in the UTF-7 encoding. The sample "µm"
119 is "+ALU-m" in this encoding.
120
121 The UTF-7 encoding only use plain US-ASCII characters for the
122 encoding. This makes it safe for transport through 8-bit stripping
123 protocols. Characters outside the US-ASCII range are
124 base64-encoded and '+' is used as an escape character. The UTF-7
125 encoding is described in RFC 1642.
126
127 If the (global) variable
128 $Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider
129 range of characters are encoded as themselves. It is even TRUE by
130 default. The characters affected by this are:
131
132 ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
133
134 $us->latin1
135 $us->latin1( $newval )
136 The string passed should be in the ISO-8859-1 encoding. The sample
137 "µm" is "\xB5m" in this encoding.
138
139 Characters outside the "\x00" .. "\xFF" range are simply removed
140 from the return value of the latin1() method. If you want more
141 control over the mapping from Unicode to ISO-8859-1, use the
142 "Unicode::Map8" class. This is also the way to deal with other
143 8-bit character sets.
144
145 $us->hex
146 $us->hex( $newval )
147 The string passed should be plain ASCII where each Unicode
148 character is represented by the "U+XXXX" string and separated by a
149 single space character. The "U+" prefix is optional when setting
150 the value. The sample "µm" is "U+00b5 U+006d" in this encoding.
151
152 String Operations
153 The following methods are available:
154
155 $us->as_string
156 Converts a "Unicode::String" to a plain string according to the
157 setting of stringify_as(). The default stringify_as() encoding is
158 "utf8".
159
160 $us->as_num
161 Converts a "Unicode::String" to a number. Currently only the
162 digits in the range 0x30 .. 0x39 are recognized. The plan is to
163 eventually support all Unicode digit characters.
164
165 $us->as_bool
166 Converts a "Unicode::String" to a boolean value. Only the empty
167 string is FALSE. A string consisting of only the character U+0030
168 is considered TRUE, even if Perl consider "0" to be FALSE.
169
170 $us->repeat( $count )
171 Returns a new "Unicode::String" where the content of $us is
172 repeated $count times. This operation is also overloaded as:
173
174 $us x $count
175
176 $us->concat( $other_string )
177 Concatenates the string $us and the string $other_string. If
178 $other_string is not an "Unicode::String" object, then it is first
179 passed to the Unicode::String->new constructor function. This
180 operation is also overloaded as:
181
182 $us . $other_string
183
184 $us->append( $other_string )
185 Appends the string $other_string to the value of $us. If
186 $other_string is not an "Unicode::String" object, then it is first
187 passed to the Unicode::String->new constructor function. This
188 operation is also overloaded as:
189
190 $us .= $other_string
191
192 $us->copy
193 Returns a copy of the current "Unicode::String" object. This
194 operation is overloaded as the assignment operator.
195
196 $us->length
197 Returns the length of the "Unicode::String". Surrogate pairs are
198 still counted as 2.
199
200 $us->byteswap
201 This method will swap the bytes in the internal representation of
202 the "Unicode::String" object.
203
204 Unicode reserve the character U+FEFF character as a byte order
205 mark. This works because the swapped character, U+FFFE, is
206 reserved to not be valid. For strings that have the byte order
207 mark as the first character, we can guaranty to get the byte order
208 right with the following code:
209
210 $ustr->byteswap if $ustr->ord == 0xFFFE;
211
212 $us->unpack
213 Returns a list of integers each representing an UCS-2 character
214 code.
215
216 $us->pack( @uchr )
217 Sets the value of $us as a sequence of UCS-2 characters with the
218 characters codes given as parameter.
219
220 $us->ord
221 Returns the character code of the first character in $us. The
222 ord() method deals with surrogate pairs, which gives us a result-
223 range of 0x0 .. 0x10FFFF. If the $us string is empty, undef is
224 returned.
225
226 $us->chr( $code )
227 Sets the value of $us to be a string containing the character
228 assigned code $code. The argument $code must be an integer in the
229 range 0x0 .. 0x10FFFF. If the code is greater than 0xFFFF then a
230 surrogate pair created.
231
232 $us->name
233 In scalar context returns the official Unicode name of the first
234 character in $us. In array context returns the name of all
235 characters in $us. Also see Unicode::CharName.
236
237 $us->substr( $offset )
238 $us->substr( $offset, $length )
239 $us->substr( $offset, $length, $subst )
240 Returns a sub-string of $us. Works similar to the builtin substr()
241 function.
242
243 $us->index( $other )
244 $us->index( $other, $pos )
245 Locates the position of $other within $us, possibly starting the
246 search at position $pos.
247
248 $us->chop
249 Chops off the last character of $us and returns it (as a
250 "Unicode::String" object).
251
253 The following functions are provided. None of these are exported by
254 default.
255
256 byteswap2( $str, ... )
257 This function will swap 2 and 2 bytes in the strings passed as
258 arguments. If this function is called in void context, then it
259 will modify its arguments in-place. Otherwise, the swapped strings
260 are returned.
261
262 byteswap4( $str, ... )
263 The byteswap4 function works similar to byteswap2, but will reverse
264 the order of 4 and 4 bytes.
265
266 latin1( $str )
267 utf7( $str )
268 utf8( $str )
269 utf16le( $str )
270 utf16be( $str )
271 utf32le( $str )
272 utf32be( $str )
273 Constructor functions for the various Unicode encodings. These
274 return new "Unicode::String" objects. The provided argument should
275 be encoded correspondingly.
276
277 uhex( $str )
278 Constructs a new "Unicode::String" object from a string of hex
279 values. See hex() method above for description of the format.
280
281 uchar( $num )
282 Constructs a new one character "Unicode::String" object from a
283 Unicode character code. This works similar to perl's builtin chr()
284 function.
285
287 Unicode::CharName, Unicode::Map8
288
289 <http://www.unicode.org/>
290
291 perlunicode
292
294 Copyright 1997-2000,2005 Gisle Aas.
295
296 This library is free software; you can redistribute it and/or modify it
297 under the same terms as Perl itself.
298
300 Hey! The above document had some coding errors, which are explained
301 below:
302
303 Around line 600:
304 Non-ASCII character seen before =encoding in '"µm"'. Assuming UTF-8
305
306
307
308perl v5.38.0 2023-07-21 String(3)