Unicode::String(3pm)

1String(3)             User Contributed Perl Documentation            String(3)
2
3
4

NAME

6       Unicode::String - String of Unicode characters (UTF-16BE)
7

SYNOPSIS

9        use Unicode::String qw(utf8 latin1 utf16be);
10
11        $u = utf8("string");
12        $u = latin1("string");
13        $u = utf16be("\0s\0t\0r\0i\0n\0g");
14
15        print $u->utf32be;   # 4 byte characters
16        print $u->utf16le;   # 2 byte characters + surrogates
17        print $u->utf8;      # 1-4 byte characters
18

DESCRIPTION

20       A "Unicode::String" object represents a sequence of Unicode characters.
21       Methods are provided to convert between various external formats
22       (encodings) and "Unicode::String" objects, and methods are provided for
23       common string manipulations.
24
25       The functions utf32be(), utf32le(), utf16be(), utf16le(), utf8(),
26       utf7(), latin1(), uhex(), uchr() can be imported from the "Uni‐
27       code::String" module and will work as constructors initializing strings
28       of the corresponding encoding.
29
30       The "Unicode::String" objects overload various operators, which means
31       that they in most cases can be treated like plain strings.
32
33       Internally a "Unicode::String" object is represented by a string of 2
34       byte numbers in network byte order (big-endian). This representation is
35       not visible by the API provided, but it might be useful to know in
36       order to predict the efficiency of the provided methods.
37
38       METHODS
39
40       Class methods
41
42       The following class methods are available:
43
44       Unicode::String->stringify_as
45       Unicode::String->stringify_as( $enc )
46           This method is used to specify which encoding will be used when
47           "Unicode::String" objects are implicitly converted to and from
48           plain strings.
49
50           If an argument is provided it sets the current encoding.  The argu‐
51           ment should have one of the following: "ucs4", "utf32", "utf32be",
52           "utf32le", "ucs2", "utf16", "utf16be", "utf16le", "utf8", "utf7",
53           "latin1" or "hex".  The default is "utf8".
54
55           The stringify_as() method returns a reference to the current encod‐
56           ing function.
57
58       $us = Unicode::String->new
59       $us = Unicode::String->new( $initial_value )
60           This is the object constructor.  Without argument, it creates an
61           empty "Unicode::String" object.  If an $initial_value argument is
62           given, it is decoded according to the specified stringify_as()
63           encoding, UTF-8 by default.
64
65           In general it is recommended to import and use one of the encoding
66           specific constructor functions instead of invoking this method.
67
68       Encoding methods
69
70       These methods get or set the value of the "Unicode::String" object by
71       passing strings in the corresponding encoding.  If a new value is
72       passed as argument it will set the value of the "Unicode::String", and
73       the previous value is returned.  If no argument is passed then the cur‐
74       rent value is returned.
75
76       To illustrate the encodings we show how the 2 character sample string
77       of "µm" (micro meter) is encoded for each one.
78
79       $us->utf32be
80       $us->utf32be( $newval )
81           The string passed should be in the UTF-32 encoding with bytes in
82           big endian order.  The sample "µm" is "\0\0\0\xB5\0\0\0m" in this
83           encoding.
84
85           Alternative names for this method are utf32() and ucs4().
86
87       $us->utf32le
88       $us->utf32le( $newval )
89           The string passed should be in the UTF-32 encoding with bytes in
90           little endian order.  The sample "µm" is is "\xB5\0\0\0m\0\0\0" in
91           this encoding.
92
93       $us->utf16be
94       $us->utf16be( $newval )
95           The string passed should be in the UTF-16 encoding with bytes in
96           big endian order. The sample "µm" is "\0\xB5\0m" in this encoding.
97
98           Alternative names for this method are utf16() and ucs2().
99
100           If the string passed to utf16be() starts with the Unicode byte
101           order mark in little endian order, the result is as if utf16le()
102           was called instead.
103
104       $us->utf16le
105       $us->utf16le( $newval )
106           The string passed should be in the UTF-16 encoding with bytes in
107           little endian order.  The sample "µm" is is "\xB5\0m\0" in this
108           encoding.  This is the encoding used by the Microsoft Windows API.
109
110           If the string passed to utf16le() starts with the Unicode byte
111           order mark in big endian order, the result is as if utf16le() was
112           called instead.
113
114       $us->utf8
115       $us->utf8( $newval )
116           The string passed should be in the UTF-8 encoding. The sample "µm"
117           is "\xC2\xB5m" in this encoding.
118
119       $us->utf7
120       $us->utf7( $newval )
121           The string passed should be in the UTF-7 encoding. The sample "µm"
122           is "+ALU-m" in this encoding.
123
124           The UTF-7 encoding only use plain US-ASCII characters for the
125           encoding.  This makes it safe for transport through 8-bit stripping
126           protocols.  Characters outside the US-ASCII range are
127           base64-encoded and '+' is used as an escape character.  The UTF-7
128           encoding is described in RFC 1642.
129
130           If the (global) variable $Uni‐
131           code::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider
132           range of characters are encoded as themselves.  It is even TRUE by
133           default.  The characters affected by this are:
134
135              ! " # $ % & * ; < = > @ [ ] ^ _ ` { ⎪ }
136
137       $us->latin1
138       $us->latin1( $newval )
139           The string passed should be in the ISO-8859-1 encoding. The sample
140           "µm" is "\xB5m" in this encoding.
141
142           Characters outside the "\x00" .. "\xFF" range are simply removed
143           from the return value of the latin1() method.  If you want more
144           control over the mapping from Unicode to ISO-8859-1, use the "Uni‐
145           code::Map8" class.  This is also the way to deal with other 8-bit
146           character sets.
147
148       $us->hex
149       $us->hex( $newval )
150           The string passed should be plain ASCII where each Unicode charac‐
151           ter is represented by the "U+XXXX" string and separated by a single
152           space character.  The "U+" prefix is optional when setting the
153           value.  The sample "µm" is "U+00b5 U+006d" in this encoding.
154
155       String Operations
156
157       The following methods are available:
158
159       $us->as_string
160           Converts a "Unicode::String" to a plain string according to the
161           setting of stringify_as().  The default stringify_as() encoding is
162           "utf8".
163
164       $us->as_num
165           Converts a "Unicode::String" to a number.  Currently only the dig‐
166           its in the range 0x30 .. 0x39 are recognized.  The plan is to even‐
167           tually support all Unicode digit characters.
168
169       $us->as_bool
170           Converts a "Unicode::String" to a boolean value.  Only the empty
171           string is FALSE.  A string consisting of only the character U+0030
172           is considered TRUE, even if Perl consider "0" to be FALSE.
173
174       $us->repeat( $count )
175           Returns a new "Unicode::String" where the content of $us is
176           repeated $count times.  This operation is also overloaded as:
177
178             $us x $count
179
180       $us->concat( $other_string )
181           Concatenates the string $us and the string $other_string.  If
182           $other_string is not an "Unicode::String" object, then it is first
183           passed to the Unicode::String->new constructor function.  This
184           operation is also overloaded as:
185
186             $us . $other_string
187
188       $us->append( $other_string )
189           Appends the string $other_string to the value of $us.  If
190           $other_string is not an "Unicode::String" object, then it is first
191           passed to the Unicode::String->new constructor function.  This
192           operation is also overloaded as:
193
194             $us .= $other_string
195
196       $us->copy
197           Returns a copy of the current "Unicode::String" object.  This oper‐
198           ation is overloaded as the assignment operator.
199
200       $us->length
201           Returns the length of the "Unicode::String".  Surrogate pairs are
202           still counted as 2.
203
204       $us->byteswap
205           This method will swap the bytes in the internal representation of
206           the "Unicode::String" object.
207
208           Unicode reserve the character U+FEFF character as a byte order
209           mark.  This works because the swapped character, U+FFFE, is
210           reserved to not be valid.  For strings that have the byte order
211           mark as the first character, we can guaranty to get the byte order
212           right with the following code:
213
214              $ustr->byteswap if $ustr->ord == 0xFFFE;
215
216       $us->unpack
217           Returns a list of integers each representing an UCS-2 character
218           code.
219
220       $us->pack( @uchr )
221           Sets the value of $us as a sequence of UCS-2 characters with the
222           characters codes given as parameter.
223
224       $us->ord
225           Returns the character code of the first character in $us.  The
226           ord() method deals with surrogate pairs, which gives us a result-
227           range of 0x0 .. 0x10FFFF.  If the $us string is empty, undef is
228           returned.
229
230       $us->chr( $code )
231           Sets the value of $us to be a string containing the character
232           assigned code $code.  The argument $code must be an integer in the
233           range 0x0 .. 0x10FFFF.  If the code is greater than 0xFFFF then a
234           surrogate pair created.
235
236       $us->name
237           In scalar context returns the official Unicode name of the first
238           character in $us.  In array context returns the name of all charac‐
239           ters in $us.  Also see Unicode::CharName.
240
241       $us->substr( $offset )
242       $us->substr( $offset, $length )
243       $us->substr( $offset, $length, $subst )
244           Returns a sub-string of $us.  Works similar to the builtin substr()
245           function.
246
247       $us->index( $other )
248       $us->index( $other, $pos )
249           Locates the position of $other within $us, possibly starting the
250           search at position $pos.
251
252       $us->chop
253           Chops off the last character of $us and returns it (as a "Uni‐
254           code::String" object).
255

FUNCTIONS

257       The following functions are provided.  None of these are exported by
258       default.
259
260       byteswap2( $str, ... )
261           This function will swap 2 and 2 bytes in the strings passed as
262           arguments.  If this function is called in void context, then it
263           will modify its arguments in-place.  Otherwise, the swapped strings
264           are returned.
265
266       byteswap4( $str, ... )
267           The byteswap4 function works similar to byteswap2, but will reverse
268           the order of 4 and 4 bytes.
269
270       latin1( $str )
271       utf7( $str )
272       utf8( $str )
273       utf16le( $str )
274       utf16be( $str )
275       utf32le( $str )
276       utf32be( $str )
277           Constructor functions for the various Unicode encodings.  These
278           return new "Unicode::String" objects.  The provided argument should
279           be encoded correspondingly.
280
281       uhex( $str )
282           Constructs a new "Unicode::String" object from a string of hex val‐
283           ues.  See hex() method above for description of the format.
284
285       uchar( $num )
286           Constructs a new one character "Unicode::String" object from a Uni‐
287           code character code.  This works similar to perl's builtin chr()
288           function.
289

COPYRIGHT

298       Copyright 1997-2000,2005 Gisle Aas.
299
300       This library is free software; you can redistribute it and/or modify it
301       under the same terms as Perl itself.
302
303
304
305perl v5.8.8                       2005-10-26                         String(3)

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

SEE ALSO

COPYRIGHT