Unicode::String(3pm)

1String(3)             User Contributed Perl Documentation            String(3)
2
3
4

NAME

6       Unicode::String - String of Unicode characters (UTF-16BE)
7

SYNOPSIS

9        use Unicode::String qw(utf8 latin1 utf16be);
10
11        $u = utf8("string");
12        $u = latin1("string");
13        $u = utf16be("\0s\0t\0r\0i\0n\0g");
14
15        print $u->utf32be;   # 4 byte characters
16        print $u->utf16le;   # 2 byte characters + surrogates
17        print $u->utf8;      # 1-4 byte characters
18

DESCRIPTION

20       A "Unicode::String" object represents a sequence of Unicode characters.
21       Methods are provided to convert between various external formats
22       (encodings) and "Unicode::String" objects, and methods are provided for
23       common string manipulations.
24
25       The functions utf32be(), utf32le(), utf16be(), utf16le(), utf8(),
26       utf7(), latin1(), uhex(), uchr() can be imported from the
27       "Unicode::String" module and will work as constructors initializing
28       strings of the corresponding encoding.
29
30       The "Unicode::String" objects overload various operators, which means
31       that they in most cases can be treated like plain strings.
32
33       Internally a "Unicode::String" object is represented by a string of 2
34       byte numbers in network byte order (big-endian). This representation is
35       not visible by the API provided, but it might be useful to know in
36       order to predict the efficiency of the provided methods.
37
38   METHODS
39   Class methods
40       The following class methods are available:
41
42       Unicode::String->stringify_as
43       Unicode::String->stringify_as( $enc )
44           This method is used to specify which encoding will be used when
45           "Unicode::String" objects are implicitly converted to and from
46           plain strings.
47
48           If an argument is provided it sets the current encoding.  The
49           argument should have one of the following: "ucs4", "utf32",
50           "utf32be", "utf32le", "ucs2", "utf16", "utf16be", "utf16le",
51           "utf8", "utf7", "latin1" or "hex".  The default is "utf8".
52
53           The stringify_as() method returns a reference to the current
54           encoding function.
55
56       $us = Unicode::String->new
57       $us = Unicode::String->new( $initial_value )
58           This is the object constructor.  Without argument, it creates an
59           empty "Unicode::String" object.  If an $initial_value argument is
60           given, it is decoded according to the specified stringify_as()
61           encoding, UTF-8 by default.
62
63           In general it is recommended to import and use one of the encoding
64           specific constructor functions instead of invoking this method.
65
66   Encoding methods
67       These methods get or set the value of the "Unicode::String" object by
68       passing strings in the corresponding encoding.  If a new value is
69       passed as argument it will set the value of the "Unicode::String", and
70       the previous value is returned.  If no argument is passed then the
71       current value is returned.
72
73       To illustrate the encodings we show how the 2 character sample string
74       of "µm" (micro meter) is encoded for each one.
75
76       $us->utf32be
77       $us->utf32be( $newval )
78           The string passed should be in the UTF-32 encoding with bytes in
79           big endian order.  The sample "µm" is "\0\0\0\xB5\0\0\0m" in this
80           encoding.
81
82           Alternative names for this method are utf32() and ucs4().
83
84       $us->utf32le
85       $us->utf32le( $newval )
86           The string passed should be in the UTF-32 encoding with bytes in
87           little endian order.  The sample "µm" is is "\xB5\0\0\0m\0\0\0" in
88           this encoding.
89
90       $us->utf16be
91       $us->utf16be( $newval )
92           The string passed should be in the UTF-16 encoding with bytes in
93           big endian order. The sample "µm" is "\0\xB5\0m" in this encoding.
94
95           Alternative names for this method are utf16() and ucs2().
96
97           If the string passed to utf16be() starts with the Unicode byte
98           order mark in little endian order, the result is as if utf16le()
99           was called instead.
100
101       $us->utf16le
102       $us->utf16le( $newval )
103           The string passed should be in the UTF-16 encoding with bytes in
104           little endian order.  The sample "µm" is is "\xB5\0m\0" in this
105           encoding.  This is the encoding used by the Microsoft Windows API.
106
107           If the string passed to utf16le() starts with the Unicode byte
108           order mark in big endian order, the result is as if utf16le() was
109           called instead.
110
111       $us->utf8
112       $us->utf8( $newval )
113           The string passed should be in the UTF-8 encoding. The sample "µm"
114           is "\xC2\xB5m" in this encoding.
115
116       $us->utf7
117       $us->utf7( $newval )
118           The string passed should be in the UTF-7 encoding. The sample "µm"
119           is "+ALU-m" in this encoding.
120
121           The UTF-7 encoding only use plain US-ASCII characters for the
122           encoding.  This makes it safe for transport through 8-bit stripping
123           protocols.  Characters outside the US-ASCII range are
124           base64-encoded and '+' is used as an escape character.  The UTF-7
125           encoding is described in RFC 1642.
126
127           If the (global) variable
128           $Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider
129           range of characters are encoded as themselves.  It is even TRUE by
130           default.  The characters affected by this are:
131
132              ! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
133
134       $us->latin1
135       $us->latin1( $newval )
136           The string passed should be in the ISO-8859-1 encoding. The sample
137           "µm" is "\xB5m" in this encoding.
138
139           Characters outside the "\x00" .. "\xFF" range are simply removed
140           from the return value of the latin1() method.  If you want more
141           control over the mapping from Unicode to ISO-8859-1, use the
142           "Unicode::Map8" class.  This is also the way to deal with other
143           8-bit character sets.
144
145       $us->hex
146       $us->hex( $newval )
147           The string passed should be plain ASCII where each Unicode
148           character is represented by the "U+XXXX" string and separated by a
149           single space character.  The "U+" prefix is optional when setting
150           the value.  The sample "µm" is "U+00b5 U+006d" in this encoding.
151
152   String Operations
153       The following methods are available:
154
155       $us->as_string
156           Converts a "Unicode::String" to a plain string according to the
157           setting of stringify_as().  The default stringify_as() encoding is
158           "utf8".
159
160       $us->as_num
161           Converts a "Unicode::String" to a number.  Currently only the
162           digits in the range 0x30 .. 0x39 are recognized.  The plan is to
163           eventually support all Unicode digit characters.
164
165       $us->as_bool
166           Converts a "Unicode::String" to a boolean value.  Only the empty
167           string is FALSE.  A string consisting of only the character U+0030
168           is considered TRUE, even if Perl consider "0" to be FALSE.
169
170       $us->repeat( $count )
171           Returns a new "Unicode::String" where the content of $us is
172           repeated $count times.  This operation is also overloaded as:
173
174             $us x $count
175
176       $us->concat( $other_string )
177           Concatenates the string $us and the string $other_string.  If
178           $other_string is not an "Unicode::String" object, then it is first
179           passed to the Unicode::String->new constructor function.  This
180           operation is also overloaded as:
181
182             $us . $other_string
183
184       $us->append( $other_string )
185           Appends the string $other_string to the value of $us.  If
186           $other_string is not an "Unicode::String" object, then it is first
187           passed to the Unicode::String->new constructor function.  This
188           operation is also overloaded as:
189
190             $us .= $other_string
191
192       $us->copy
193           Returns a copy of the current "Unicode::String" object.  This
194           operation is overloaded as the assignment operator.
195
196       $us->length
197           Returns the length of the "Unicode::String".  Surrogate pairs are
198           still counted as 2.
199
200       $us->byteswap
201           This method will swap the bytes in the internal representation of
202           the "Unicode::String" object.
203
204           Unicode reserve the character U+FEFF character as a byte order
205           mark.  This works because the swapped character, U+FFFE, is
206           reserved to not be valid.  For strings that have the byte order
207           mark as the first character, we can guaranty to get the byte order
208           right with the following code:
209
210              $ustr->byteswap if $ustr->ord == 0xFFFE;
211
212       $us->unpack
213           Returns a list of integers each representing an UCS-2 character
214           code.
215
216       $us->pack( @uchr )
217           Sets the value of $us as a sequence of UCS-2 characters with the
218           characters codes given as parameter.
219
220       $us->ord
221           Returns the character code of the first character in $us.  The
222           ord() method deals with surrogate pairs, which gives us a result-
223           range of 0x0 .. 0x10FFFF.  If the $us string is empty, undef is
224           returned.
225
226       $us->chr( $code )
227           Sets the value of $us to be a string containing the character
228           assigned code $code.  The argument $code must be an integer in the
229           range 0x0 .. 0x10FFFF.  If the code is greater than 0xFFFF then a
230           surrogate pair created.
231
232       $us->name
233           In scalar context returns the official Unicode name of the first
234           character in $us.  In array context returns the name of all
235           characters in $us.  Also see Unicode::CharName.
236
237       $us->substr( $offset )
238       $us->substr( $offset, $length )
239       $us->substr( $offset, $length, $subst )
240           Returns a sub-string of $us.  Works similar to the builtin substr()
241           function.
242
243       $us->index( $other )
244       $us->index( $other, $pos )
245           Locates the position of $other within $us, possibly starting the
246           search at position $pos.
247
248       $us->chop
249           Chops off the last character of $us and returns it (as a
250           "Unicode::String" object).
251

FUNCTIONS

253       The following functions are provided.  None of these are exported by
254       default.
255
256       byteswap2( $str, ... )
257           This function will swap 2 and 2 bytes in the strings passed as
258           arguments.  If this function is called in void context, then it
259           will modify its arguments in-place.  Otherwise, the swapped strings
260           are returned.
261
262       byteswap4( $str, ... )
263           The byteswap4 function works similar to byteswap2, but will reverse
264           the order of 4 and 4 bytes.
265
266       latin1( $str )
267       utf7( $str )
268       utf8( $str )
269       utf16le( $str )
270       utf16be( $str )
271       utf32le( $str )
272       utf32be( $str )
273           Constructor functions for the various Unicode encodings.  These
274           return new "Unicode::String" objects.  The provided argument should
275           be encoded correspondingly.
276
277       uhex( $str )
278           Constructs a new "Unicode::String" object from a string of hex
279           values.  See hex() method above for description of the format.
280
281       uchar( $num )
282           Constructs a new one character "Unicode::String" object from a
283           Unicode character code.  This works similar to perl's builtin chr()
284           function.
285

COPYRIGHT

294       Copyright 1997-2000,2005 Gisle Aas.
295
296       This library is free software; you can redistribute it and/or modify it
297       under the same terms as Perl itself.
298

POD ERRORS

300       Hey! The above document had some coding errors, which are explained
301       below:
302
303       Around line 600:
304           Non-ASCII character seen before =encoding in '"µm"'. Assuming UTF-8
305
306
307
308perl v5.38.0                      2023-07-21                         String(3)

NAME

SYNOPSIS

DESCRIPTION

FUNCTIONS

SEE ALSO

COPYRIGHT

POD ERRORS