Encode(3pm)

1Encode(3pm)            Perl Programmers Reference Guide            Encode(3pm)
2
3
4

NAME

6       Encode - character encodings
7

SYNOPSIS

9           use Encode;
10
11       Table of Contents
12
13       Encode consists of a collection of modules whose details are too big to
14       fit in one document.  This POD itself explains the top-level APIs and
15       general topics at a glance.  For other topics and more details, see the
16       PODs below:
17
18         Name                          Description
19         --------------------------------------------------------
20         Encode::Alias         Alias definitions to encodings
21         Encode::Encoding      Encode Implementation Base Class
22         Encode::Supported     List of Supported Encodings
23         Encode::CN            Simplified Chinese Encodings
24         Encode::JP            Japanese Encodings
25         Encode::KR            Korean Encodings
26         Encode::TW            Traditional Chinese Encodings
27         --------------------------------------------------------
28

DESCRIPTION

30       The "Encode" module provides the interfaces between Perl's strings and
31       the rest of the system.  Perl strings are sequences of characters.
32
33       The repertoire of characters that Perl can represent is at least that
34       defined by the Unicode Consortium. On most platforms the ordinal values
35       of the characters (as returned by "ord(ch)") is the "Unicode codepoint"
36       for the character (the exceptions are those platforms where the legacy
37       encoding is some variant of EBCDIC rather than a super-set of ASCII -
38       see perlebcdic).
39
40       Traditionally, computer data has been moved around in 8-bit chunks
41       often called "bytes". These chunks are also known as "octets" in net‐
42       working standards. Perl is widely used to manipulate data of many types
43       - not only strings of characters representing human or computer lan‐
44       guages but also "binary" data being the machine's representation of
45       numbers, pixels in an image - or just about anything.
46
47       When Perl is processing "binary data", the programmer wants Perl to
48       process "sequences of bytes". This is not a problem for Perl - as a
49       byte has 256 possible values, it easily fits in Perl's much larger
50       "logical character".
51
52       TERMINOLOGY
53
54       · character: a character in the range 0..(2**32-1) (or more).  (What
55         Perl's strings are made of.)
56
57       · byte: a character in the range 0..255 (A special case of a Perl char‐
58         acter.)
59
60       · octet: 8 bits of data, with ordinal values 0..255 (Term for bytes
61         passed to or from a non-Perl context, e.g. a disk file.)
62

PERL ENCODING API

64       $octets  = encode(ENCODING, $string [, CHECK])
65         Encodes a string from Perl's internal form into ENCODING and returns
66         a sequence of octets.  ENCODING can be either a canonical name or an
67         alias.  For encoding names and aliases, see "Defining Aliases".  For
68         CHECK, see "Handling Malformed Data".
69
70         For example, to convert a string from Perl's internal format to
71         iso-8859-1 (also known as Latin1),
72
73           $octets = encode("iso-8859-1", $string);
74
75         CAVEAT: When you run "$octets = encode("utf8", $string)", then
76         $octets may not be equal to $string.  Though they both contain the
77         same data, the utf8 flag for $octets is always off.  When you encode
78         anything, utf8 flag of the result is always off, even when it con‐
79         tains completely valid utf8 string. See "The UTF-8 flag" below.
80
81         If the $string is "undef" then "undef" is returned.
82
83       $string = decode(ENCODING, $octets [, CHECK])
84         Decodes a sequence of octets assumed to be in ENCODING into Perl's
85         internal form and returns the resulting string.  As in encode(),
86         ENCODING can be either a canonical name or an alias. For encoding
87         names and aliases, see "Defining Aliases".  For CHECK, see "Handling
88         Malformed Data".
89
90         For example, to convert ISO-8859-1 data to a string in Perl's inter‐
91         nal format:
92
93           $string = decode("iso-8859-1", $octets);
94
95         CAVEAT: When you run "$string = decode("utf8", $octets)", then
96         $string may not be equal to $octets.  Though they both contain the
97         same data, the utf8 flag for $string is on unless $octets entirely
98         consists of ASCII data (or EBCDIC on EBCDIC machines).  See "The
99         UTF-8 flag" below.
100
101         If the $string is "undef" then "undef" is returned.
102
103       [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
104         Converts in-place data between two encodings. The data in $octets
105         must be encoded as octets and not as characters in Perl's internal
106         format. For example, to convert ISO-8859-1 data to Microsoft's CP1250
107         encoding:
108
109           from_to($octets, "iso-8859-1", "cp1250");
110
111         and to convert it back:
112
113           from_to($octets, "cp1250", "iso-8859-1");
114
115         Note that because the conversion happens in place, the data to be
116         converted cannot be a string constant; it must be a scalar variable.
117
118         from_to() returns the length of the converted string in octets on
119         success, undef on error.
120
121         CAVEAT: The following operations look the same but are not quite so;
122
123           from_to($data, "iso-8859-1", "utf8"); #1
124           $data = decode("iso-8859-1", $data);  #2
125
126         Both #1 and #2 make $data consist of a completely valid UTF-8 string
127         but only #2 turns utf8 flag on.  #1 is equivalent to
128
129           $data = encode("utf8", decode("iso-8859-1", $data));
130
131         See "The UTF-8 flag" below.
132
133       $octets = encode_utf8($string);
134         Equivalent to "$octets = encode("utf8", $string);" The characters
135         that comprise $string are encoded in Perl's internal format and the
136         result is returned as a sequence of octets. All possible characters
137         have a UTF-8 representation so this function cannot fail.
138
139       $string = decode_utf8($octets [, CHECK]);
140         equivalent to "$string = decode("utf8", $octets [, CHECK])".  The
141         sequence of octets represented by $octets is decoded from UTF-8 into
142         a sequence of logical characters. Not all sequences of octets form
143         valid UTF-8 encodings, so it is possible for this call to fail.  For
144         CHECK, see "Handling Malformed Data".
145
146       Listing available encodings
147
148         use Encode;
149         @list = Encode->encodings();
150
151       Returns a list of the canonical names of the available encodings that
152       are loaded.  To get a list of all available encodings including the
153       ones that are not loaded yet, say
154
155         @all_encodings = Encode->encodings(":all");
156
157       Or you can give the name of a specific module.
158
159         @with_jp = Encode->encodings("Encode::JP");
160
161       When "::" is not in the name, "Encode::" is assumed.
162
163         @ebcdic = Encode->encodings("EBCDIC");
164
165       To find out in detail which encodings are supported by this package,
166       see Encode::Supported.
167
168       Defining Aliases
169
170       To add a new alias to a given encoding, use:
171
172         use Encode;
173         use Encode::Alias;
174         define_alias(newName => ENCODING);
175
176       After that, newName can be used as an alias for ENCODING.  ENCODING may
177       be either the name of an encoding or an encoding object
178
179       But before you do so, make sure the alias is nonexistent with
180       "resolve_alias()", which returns the canonical name thereof.  i.e.
181
182         Encode::resolve_alias("latin1") eq "iso-8859-1" # true
183         Encode::resolve_alias("iso-8859-12")   # false; nonexistent
184         Encode::resolve_alias($name) eq $name  # true if $name is canonical
185
186       resolve_alias() does not need "use Encode::Alias"; it can be exported
187       via "use Encode qw(resolve_alias)".
188
189       See Encode::Alias for details.
190

Encoding via PerlIO

192       If your perl supports PerlIO (which is the default), you can use a Per‐
193       lIO layer to decode and encode directly via a filehandle.  The follow‐
194       ing two examples are totally identical in their functionality.
195
196         # via PerlIO
197         open my $in,  "<:encoding(shiftjis)", $infile  or die;
198         open my $out, ">:encoding(euc-jp)",   $outfile or die;
199         while(<$in>){ print $out $_; }
200
201         # via from_to
202         open my $in,  "<", $infile  or die;
203         open my $out, ">", $outfile or die;
204         while(<$in>){
205           from_to($_, "shiftjis", "euc-jp", 1);
206           print $out $_;
207         }
208
209       Unfortunately, it may be that encodings are PerlIO-savvy.  You can
210       check if your encoding is supported by PerlIO by calling the "per‐
211       lio_ok" method.
212
213         Encode::perlio_ok("hz");             # False
214         find_encoding("euc-cn")->perlio_ok;  # True where PerlIO is available
215
216         use Encode qw(perlio_ok);            # exported upon request
217         perlio_ok("euc-jp")
218
219       Fortunately, all encodings that come with Encode core are PerlIO-savvy
220       except for hz and ISO-2022-kr.  For gory details, see Encode::Encoding
221       and Encode::PerlIO.
222

Handling Malformed Data

224       The optional CHECK argument tells Encode what to do when it encounters
225       malformed data.  Without CHECK, Encode::FB_DEFAULT ( == 0 ) is assumed.
226
227       As of version 2.12 Encode supports coderef values for CHECK.  See
228       below.
229
230       NOTE: Not all encoding support this feature
231         Some encodings ignore CHECK argument.  For example, Encode::Unicode
232         ignores CHECK and it always croaks on error.
233
234       Now here is the list of CHECK values available
235
236       CHECK = Encode::FB_DEFAULT ( == 0)
237         If CHECK is 0, (en⎪de)code will put a substitution character in place
238         of a malformed character.  When you encode, <subchar> will be used.
239         When you decode the code point 0xFFFD is used.  If the data is sup‐
240         posed to be UTF-8, an optional lexical warning (category utf8) is
241         given.
242
243       CHECK = Encode::FB_CROAK ( == 1)
244         If CHECK is 1, methods will die on error immediately with an error
245         message.  Therefore, when CHECK is set to 1,  you should trap the
246         error with eval{} unless you really want to let it die.
247
248       CHECK = Encode::FB_QUIET
249         If CHECK is set to Encode::FB_QUIET, (en⎪de)code will immediately
250         return the portion of the data that has been processed so far when an
251         error occurs. The data argument will be overwritten with everything
252         after that point (that is, the unprocessed part of data).  This is
253         handy when you have to call decode repeatedly in the case where your
254         source data may contain partial multi-byte character sequences, (i.e.
255         you are reading with a fixed-width buffer). Here is a sample code
256         that does exactly this:
257
258           my $buffer = ''; my $string = '';
259           while(read $fh, $buffer, 256, length($buffer)){
260             $string .= decode($encoding, $buffer, Encode::FB_QUIET);
261             # $buffer now contains the unprocessed partial character
262           }
263
264       CHECK = Encode::FB_WARN
265         This is the same as above, except that it warns on error.  Handy when
266         you are debugging the mode above.
267
268       perlqq mode (CHECK = Encode::FB_PERLQQ)
269       HTML charref mode (CHECK = Encode::FB_HTMLCREF)
270       XML charref mode (CHECK = Encode::FB_XMLCREF)
271         For encodings that are implemented by Encode::XS, CHECK ==
272         Encode::FB_PERLQQ turns (en⎪de)code into "perlqq" fallback mode.
273
274         When you decode, "\xHH" will be inserted for a malformed character,
275         where HH is the hex representation of the octet  that could not be
276         decoded to utf8.  And when you encode, "\x{HHHH}" will be inserted,
277         where HHHH is the Unicode ID of the character that cannot be found in
278         the character repertoire of the encoding.
279
280         HTML/XML character reference modes are about the same, in place of
281         "\x{HHHH}", HTML uses "&#NNN;" where NNN is a decimal number and XML
282         uses "&#xHHHH;" where HHHH is the hexadecimal number.
283
284         In Encode 2.10 or later, "LEAVE_SRC" is also implied.
285
286       The bitmask
287         These modes are actually set via a bitmask.  Here is how the FB_XX
288         constants are laid out.  You can import the FB_XX constants via "use
289         Encode qw(:fallbacks)"; you can import the generic bitmask constants
290         via "use Encode qw(:fallback_all)".
291
292                              FB_DEFAULT FB_CROAK FB_QUIET FB_WARN  FB_PERLQQ
293          DIE_ON_ERR    0x0001             X
294          WARN_ON_ERR   0x0002                               X
295          RETURN_ON_ERR 0x0004                      X        X
296          LEAVE_SRC     0x0008                                        X
297          PERLQQ        0x0100                                        X
298          HTMLCREF      0x0200
299          XMLCREF       0x0400
300
301       coderef for CHECK
302
303       As of Encode 2.12 CHECK can also be a code reference which takes the
304       ord value of unmapped caharacter as an argument and returns a string
305       that represents the fallback character.  For instance,
306
307         $ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });
308
309       Acts like FB_PERLQQ but <U+XXXX> is used instead of \x{XXXX}.
310

Defining Encodings

312       To define a new encoding, use:
313
314           use Encode qw(define_encoding);
315           define_encoding($object, 'canonicalName' [, alias...]);
316
317       canonicalName will be associated with $object.  The object should pro‐
318       vide the interface described in Encode::Encoding.  If more than two
319       arguments are provided then additional arguments are taken as aliases
320       for $object.
321
322       See Encode::Encoding for more details.
323

The UTF-8 flag

325       Before the introduction of utf8 support in perl, The "eq" operator just
326       compared the strings represented by two scalars. Beginning with perl
327       5.8, "eq" compares two strings with simultaneous consideration of the
328       utf8 flag. To explain why we made it so, I will quote page 402 of "Pro‐
329       gramming Perl, 3rd ed."
330
331       Goal #1:
332         Old byte-oriented programs should not spontaneously break on the old
333         byte-oriented data they used to work on.
334
335       Goal #2:
336         Old byte-oriented programs should magically start working on the new
337         character-oriented data when appropriate.
338
339       Goal #3:
340         Programs should run just as fast in the new character-oriented mode
341         as in the old byte-oriented mode.
342
343       Goal #4:
344         Perl should remain one language, rather than forking into a byte-ori‐
345         ented Perl and a character-oriented Perl.
346
347       Back when "Programming Perl, 3rd ed." was written, not even Perl 5.6.0
348       was born and many features documented in the book remained unimple‐
349       mented for a long time.  Perl 5.8 corrected this and the introduction
350       of the UTF-8 flag is one of them.  You can think of this perl notion as
351       of a byte-oriented mode (utf8 flag off) and a character-oriented mode
352       (utf8 flag on).
353
354       Here is how Encode takes care of the utf8 flag.
355
356       · When you encode, the resulting utf8 flag is always off.
357
358       · When you decode, the resulting utf8 flag is on unless you can unam‐
359         biguously represent data.  Here is the definition of dis-ambiguity.
360
361         After "$utf8 = decode('foo', $octet);",
362
363           When $octet is...   The utf8 flag in $utf8 is
364           ---------------------------------------------
365           In ASCII only (or EBCDIC only)            OFF
366           In ISO-8859-1                              ON
367           In any other Encoding                      ON
368           ---------------------------------------------
369
370         As you see, there is one exception, In ASCII.  That way you can
371         assume Goal #1.  And with Encode Goal #2 is assumed but you still
372         have to be careful in such cases mentioned in CAVEAT paragraphs.
373
374         This utf8 flag is not visible in perl scripts, exactly for the same
375         reason you cannot (or you don't have to) see if a scalar contains a
376         string, integer, or floating point number.   But you can still peek
377         and poke these if you will.  See the section below.
378
379       Messing with Perl's Internals
380
381       The following API uses parts of Perl's internals in the current imple‐
382       mentation.  As such, they are efficient but may change.
383
384       is_utf8(STRING [, CHECK])
385         [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
386         If CHECK is true, also checks the data in STRING for being well-
387         formed UTF-8.  Returns true if successful, false otherwise.
388
389         As of perl 5.8.1, utf8 also has utf8::is_utf8().
390
391       _utf8_on(STRING)
392         [INTERNAL] Turns on the UTF-8 flag in STRING.  The data in STRING is
393         not checked for being well-formed UTF-8.  Do not use unless you know
394         that the STRING is well-formed UTF-8.  Returns the previous state of
395         the UTF-8 flag (so please don't treat the return value as indicating
396         success or failure), or "undef" if STRING is not a string.
397
398       _utf8_off(STRING)
399         [INTERNAL] Turns off the UTF-8 flag in STRING.  Do not use
400         frivolously.  Returns the previous state of the UTF-8 flag (so please
401         don't treat the return value as indicating success or failure), or
402         "undef" if STRING is not a string.
403

UTF-8 vs. utf8

405         ....We now view strings not as sequences of bytes, but as sequences
406         of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
407         computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.
408
409       That has been the perl's notion of UTF-8 but official UTF-8 is more
410       strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are
411       not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al).
412
413       Now that is overruled by Larry Wall himself.
414
415         From: Larry Wall <larry@wall.org>
416         Date: December 04, 2004 11:51:58 JST
417         To: perl-unicode@perl.org
418         Subject: Re: Make Encode.pm support the real UTF-8
419         Message-Id: <20041204025158.GA28754@wall.org>
420
421         On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
422         : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
423         : but "UTF-8" is the name of the standard and should give the
424         : corresponding behaviour.
425
426         For what it's worth, that's how I've always kept them straight in my
427         head.
428
429         Also for what it's worth, Perl 6 will mostly default to strict but
430         make it easy to switch back to lax.
431
432         Larry
433
434       Do you copy?  As of Perl 5.8.7, UTF-8 means strict, official UTF-8
435       while utf8 means liberal, lax, version thereof.  And Encode version
436       2.10 or later thus groks the difference between "UTF-8" and C"utf8".
437
438         encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
439         encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
440
441       "UTF-8" in Encode is actually a canonical name for "utf-8-strict".
442       Yes, the hyphen between "UTF" and "8" is important.  Without it Encode
443       goes "liberal"
444
445         find_encoding("UTF-8")->name # is 'utf-8-strict'
446         find_encoding("utf-8")->name # ditto. names are case insensitive
447         find_encoding("utf8")->name  # ditto. "_" are treated as "-"
448         find_encoding("UTF8")->name  # is 'utf8'.
449

MAINTAINER

456       This project was originated by Nick Ing-Simmons and later maintained by
457       Dan Kogai <dankogai@dan.co.jp>.  See AUTHORS for a full list of people
458       involved.  For any questions, use <perl-unicode@perl.org> so we can all
459       share.
460
461
462
463perl v5.8.8                       2001-09-21                       Encode(3pm)