Encode(3pm)

1Encode(3)             User Contributed Perl Documentation            Encode(3)
2
3
4

NAME

6       Encode - character encodings in Perl
7

SYNOPSIS

9           use Encode qw(decode encode);
10           $characters = decode('UTF-8', $octets,     Encode::FB_CROAK);
11           $octets     = encode('UTF-8', $characters, Encode::FB_CROAK);
12
13   Table of Contents
14       Encode consists of a collection of modules whose details are too
15       extensive to fit in one document.  This one itself explains the top-
16       level APIs and general topics at a glance.  For other topics and more
17       details, see the documentation for these modules:
18
19       Encode::Alias - Alias definitions to encodings
20       Encode::Encoding - Encode Implementation Base Class
21       Encode::Supported - List of Supported Encodings
22       Encode::CN - Simplified Chinese Encodings
23       Encode::JP - Japanese Encodings
24       Encode::KR - Korean Encodings
25       Encode::TW - Traditional Chinese Encodings
26

DESCRIPTION

28       The "Encode" module provides the interface between Perl strings and the
29       rest of the system.  Perl strings are sequences of characters.
30
31       The repertoire of characters that Perl can represent is a superset of
32       those defined by the Unicode Consortium. On most platforms the ordinal
33       values of a character as returned by "ord(S)" is the Unicode codepoint
34       for that character. The exceptions are platforms where the legacy
35       encoding is some variant of EBCDIC rather than a superset of ASCII; see
36       perlebcdic.
37
38       During recent history, data is moved around a computer in 8-bit chunks,
39       often called "bytes" but also known as "octets" in standards documents.
40       Perl is widely used to manipulate data of many types: not only strings
41       of characters representing human or computer languages, but also
42       "binary" data, being the machine's representation of numbers, pixels in
43       an image, or just about anything.
44
45       When Perl is processing "binary data", the programmer wants Perl to
46       process "sequences of bytes". This is not a problem for Perl: because a
47       byte has 256 possible values, it easily fits in Perl's much larger
48       "logical character".
49
50       This document mostly explains the how. perlunitut and perlunifaq
51       explain the why.
52
53   TERMINOLOGY
54       character
55
56       A character in the range 0 .. 2**32-1 (or more); what Perl's strings
57       are made of.
58
59       byte
60
61       A character in the range 0..255; a special case of a Perl character.
62
63       octet
64
65       8 bits of data, with ordinal values 0..255; term for bytes passed to or
66       from a non-Perl context, such as a disk file, standard I/O stream,
67       database, command-line argument, environment variable, socket etc.
68

THE PERL ENCODING API

70   Basic methods
71       encode
72
73         $octets  = encode(ENCODING, STRING[, CHECK])
74
75       Encodes the scalar value STRING from Perl's internal form into ENCODING
76       and returns a sequence of octets.  ENCODING can be either a canonical
77       name or an alias.  For encoding names and aliases, see "Defining
78       Aliases".  For CHECK, see "Handling Malformed Data".
79
80       For example, to convert a string from Perl's internal format into
81       ISO-8859-1, also known as Latin1:
82
83         $octets = encode("iso-8859-1", $string);
84
85       CAVEAT: When you run "$octets = encode("utf8", $string)", then $octets
86       might not be equal to $string.  Though both contain the same data, the
87       UTF8 flag for $octets is always off.  When you encode anything, the
88       UTF8 flag on the result is always off, even when it contains a
89       completely valid utf8 string. See "The UTF8 flag" below.
90
91       If the $string is "undef", then "undef" is returned.
92
93       decode
94
95         $string = decode(ENCODING, OCTETS[, CHECK])
96
97       This function returns the string that results from decoding the scalar
98       value OCTETS, assumed to be a sequence of octets in ENCODING, into
99       Perl's internal form.  The returns the resulting string.  As with
100       encode(), ENCODING can be either a canonical name or an alias. For
101       encoding names and aliases, see "Defining Aliases"; for CHECK, see
102       "Handling Malformed Data".
103
104       For example, to convert ISO-8859-1 data into a string in Perl's
105       internal format:
106
107         $string = decode("iso-8859-1", $octets);
108
109       CAVEAT: When you run "$string = decode("utf8", $octets)", then $string
110       might not be equal to $octets.  Though both contain the same data, the
111       UTF8 flag for $string is on unless $octets consists entirely of ASCII
112       data on ASCII machines or EBCDIC on EBCDIC machines.  See "The UTF8
113       flag" below.
114
115       If the $string is "undef", then "undef" is returned.
116
117       find_encoding
118
119         [$obj =] find_encoding(ENCODING)
120
121       Returns the encoding object corresponding to ENCODING.  Returns "undef"
122       if no matching ENCODING is find.  The returned object is what does the
123       actual encoding or decoding.
124
125         $utf8 = decode($name, $bytes);
126
127       is in fact
128
129           $utf8 = do {
130               $obj = find_encoding($name);
131               croak qq(encoding "$name" not found) unless ref $obj;
132               $obj->decode($bytes);
133           };
134
135       with more error checking.
136
137       You can therefore save time by reusing this object as follows;
138
139           my $enc = find_encoding("iso-8859-1");
140           while(<>) {
141               my $utf8 = $enc->decode($_);
142               ... # now do something with $utf8;
143           }
144
145       Besides "decode" and "encode", other methods are available as well.
146       For instance, "name()" returns the canonical name of the encoding
147       object.
148
149         find_encoding("latin1")->name; # iso-8859-1
150
151       See Encode::Encoding for details.
152
153       from_to
154
155         [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
156
157       Converts in-place data between two encodings. The data in $octets must
158       be encoded as octets and not as characters in Perl's internal format.
159       For example, to convert ISO-8859-1 data into Microsoft's CP1250
160       encoding:
161
162         from_to($octets, "iso-8859-1", "cp1250");
163
164       and to convert it back:
165
166         from_to($octets, "cp1250", "iso-8859-1");
167
168       Because the conversion happens in place, the data to be converted
169       cannot be a string constant: it must be a scalar variable.
170
171       "from_to()" returns the length of the converted string in octets on
172       success, and "undef" on error.
173
174       CAVEAT: The following operations may look the same, but are not:
175
176         from_to($data, "iso-8859-1", "utf8"); #1
177         $data = decode("iso-8859-1", $data);  #2
178
179       Both #1 and #2 make $data consist of a completely valid UTF-8 string,
180       but only #2 turns the UTF8 flag on.  #1 is equivalent to:
181
182         $data = encode("utf8", decode("iso-8859-1", $data));
183
184       See "The UTF8 flag" below.
185
186       Also note that:
187
188         from_to($octets, $from, $to, $check);
189
190       is equivalent t:o
191
192         $octets = encode($to, decode($from, $octets), $check);
193
194       Yes, it does not respect the $check during decoding.  It is
195       deliberately done that way.  If you need minute control, use "decode"
196       followed by "encode" as follows:
197
198         $octets = encode($to, decode($from, $octets, $check_from), $check_to);
199
200       encode_utf8
201
202         $octets = encode_utf8($string);
203
204       Equivalent to "$octets = encode("utf8", $string)".  The characters in
205       $string are encoded in Perl's internal format, and the result is
206       returned as a sequence of octets.  Because all possible characters in
207       Perl have a (loose, not strict) UTF-8 representation, this function
208       cannot fail.
209
210       decode_utf8
211
212         $string = decode_utf8($octets [, CHECK]);
213
214       Equivalent to "$string = decode("utf8", $octets [, CHECK])".  The
215       sequence of octets represented by $octets is decoded from UTF-8 into a
216       sequence of logical characters.  Because not all sequences of octets
217       are valid UTF-8, it is quite possible for this function to fail.  For
218       CHECK, see "Handling Malformed Data".
219
220   Listing available encodings
221         use Encode;
222         @list = Encode->encodings();
223
224       Returns a list of canonical names of available encodings that have
225       already been loaded.  To get a list of all available encodings
226       including those that have not yet been loaded, say:
227
228         @all_encodings = Encode->encodings(":all");
229
230       Or you can give the name of a specific module:
231
232         @with_jp = Encode->encodings("Encode::JP");
233
234       When ""::"" is not in the name, ""Encode::"" is assumed.
235
236         @ebcdic = Encode->encodings("EBCDIC");
237
238       To find out in detail which encodings are supported by this package,
239       see Encode::Supported.
240
241   Defining Aliases
242       To add a new alias to a given encoding, use:
243
244         use Encode;
245         use Encode::Alias;
246         define_alias(NEWNAME => ENCODING);
247
248       After that, NEWNAME can be used as an alias for ENCODING.  ENCODING may
249       be either the name of an encoding or an encoding object.
250
251       Before you do that, first make sure the alias is nonexistent using
252       "resolve_alias()", which returns the canonical name thereof.  For
253       example:
254
255         Encode::resolve_alias("latin1") eq "iso-8859-1" # true
256         Encode::resolve_alias("iso-8859-12")   # false; nonexistent
257         Encode::resolve_alias($name) eq $name  # true if $name is canonical
258
259       "resolve_alias()" does not need "use Encode::Alias"; it can be imported
260       via "use Encode qw(resolve_alias)".
261
262       See Encode::Alias for details.
263
264   Finding IANA Character Set Registry names
265       The canonical name of a given encoding does not necessarily agree with
266       IANA Character Set Registry, commonly seen as "Content-Type:
267       text/plain; charset=WHATEVER".  For most cases, the canonical name
268       works, but sometimes it does not, most notably with "utf-8-strict".
269
270       As of "Encode" version 2.21, a new method "mime_name()" is therefore
271       added.
272
273         use Encode;
274         my $enc = find_encoding("UTF-8");
275         warn $enc->name;      # utf-8-strict
276         warn $enc->mime_name; # UTF-8
277
278       See also:  Encode::Encoding
279

Encoding via PerlIO

281       If your perl supports "PerlIO" (which is the default), you can use a
282       "PerlIO" layer to decode and encode directly via a filehandle.  The
283       following two examples are fully identical in functionality:
284
285         ### Version 1 via PerlIO
286           open(INPUT,  "< :encoding(shiftjis)", $infile)
287               || die "Can't open < $infile for reading: $!";
288           open(OUTPUT, "> :encoding(euc-jp)",  $outfile)
289               || die "Can't open > $output for writing: $!";
290           while (<INPUT>) {   # auto decodes $_
291               print OUTPUT;   # auto encodes $_
292           }
293           close(INPUT)   || die "can't close $infile: $!";
294           close(OUTPUT)  || die "can't close $outfile: $!";
295
296         ### Version 2 via from_to()
297           open(INPUT,  "< :raw", $infile)
298               || die "Can't open < $infile for reading: $!";
299           open(OUTPUT, "> :raw",  $outfile)
300               || die "Can't open > $output for writing: $!";
301
302           while (<INPUT>) {
303               from_to($_, "shiftjis", "euc-jp", 1);  # switch encoding
304               print OUTPUT;   # emit raw (but properly encoded) data
305           }
306           close(INPUT)   || die "can't close $infile: $!";
307           close(OUTPUT)  || die "can't close $outfile: $!";
308
309       In the first version above, you let the appropriate encoding layer
310       handle the conversion.  In the second, you explicitly translate from
311       one encoding to the other.
312
313       Unfortunately, it may be that encodings are "PerlIO"-savvy.  You can
314       check to see whether your encoding is supported by "PerlIO" by invoking
315       the "perlio_ok" method on it:
316
317         Encode::perlio_ok("hz");             # false
318         find_encoding("euc-cn")->perlio_ok;  # true wherever PerlIO is available
319
320         use Encode qw(perlio_ok);            # imported upon request
321         perlio_ok("euc-jp")
322
323       Fortunately, all encodings that come with "Encode" core are
324       "PerlIO"-savvy except for "hz" and "ISO-2022-kr".  For the gory
325       details, see Encode::Encoding and Encode::PerlIO.
326

Handling Malformed Data

328       The optional CHECK argument tells "Encode" what to do when encountering
329       malformed data.  Without CHECK, "Encode::FB_DEFAULT" (== 0) is assumed.
330
331       As of version 2.12, "Encode" supports coderef values for "CHECK"; see
332       below.
333
334       NOTE: Not all encodings support this feature.  Some encodings ignore
335       the CHECK argument.  For example, Encode::Unicode ignores CHECK and it
336       always croaks on error.
337
338   List of CHECK values
339       FB_DEFAULT
340
341         I<CHECK> = Encode::FB_DEFAULT ( == 0)
342
343       If CHECK is 0, encoding and decoding replace any malformed character
344       with a substitution character.  When you encode, SUBCHAR is used.  When
345       you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is
346       used.  If the data is supposed to be UTF-8, an optional lexical warning
347       of warning category "utf8" is given.
348
349       FB_CROAK
350
351         I<CHECK> = Encode::FB_CROAK ( == 1)
352
353       If CHECK is 1, methods immediately die with an error message.
354       Therefore, when CHECK is 1, you should trap exceptions with "eval{}",
355       unless you really want to let it "die".
356
357       FB_QUIET
358
359         I<CHECK> = Encode::FB_QUIET
360
361       If CHECK is set to "Encode::FB_QUIET", encoding and decoding
362       immediately return the portion of the data that has been processed so
363       far when an error occurs. The data argument is overwritten with
364       everything after that point; that is, the unprocessed portion of the
365       data.  This is handy when you have to call "decode" repeatedly in the
366       case where your source data may contain partial multi-byte character
367       sequences, (that is, you are reading with a fixed-width buffer). Here's
368       some sample code to do exactly that:
369
370           my($buffer, $string) = ("", "");
371           while (read($fh, $buffer, 256, length($buffer))) {
372               $string .= decode($encoding, $buffer, Encode::FB_QUIET);
373               # $buffer now contains the unprocessed partial character
374           }
375
376       FB_WARN
377
378         I<CHECK> = Encode::FB_WARN
379
380       This is the same as "FB_QUIET" above, except that instead of being
381       silent on errors, it issues a warning.  This is handy for when you are
382       debugging.
383
384       FB_PERLQQ FB_HTMLCREF FB_XMLCREF
385
386       perlqq mode (CHECK = Encode::FB_PERLQQ)
387       HTML charref mode (CHECK = Encode::FB_HTMLCREF)
388       XML charref mode (CHECK = Encode::FB_XMLCREF)
389
390       For encodings that are implemented by the "Encode::XS" module, "CHECK"
391       "==" "Encode::FB_PERLQQ" puts "encode" and "decode" into "perlqq"
392       fallback mode.
393
394       When you decode, "\xHH" is inserted for a malformed character, where HH
395       is the hex representation of the octet that could not be decoded to
396       utf8.  When you encode, "\x{HHHH}" will be inserted, where HHHH is the
397       Unicode code point (in any number of hex digits) of the character that
398       cannot be found in the character repertoire of the encoding.
399
400       The HTML/XML character reference modes are about the same. In place of
401       "\x{HHHH}", HTML uses "&#NNN;" where NNN is a decimal number, and XML
402       uses "&#xHHHH;" where HHHH is the hexadecimal number.
403
404       In "Encode" 2.10 or later, "LEAVE_SRC" is also implied.
405
406       The bitmask
407
408       These modes are all actually set via a bitmask.  Here is how the
409       "FB_XXX" constants are laid out.  You can import the "FB_XXX" constants
410       via "use Encode qw(:fallbacks)", and you can import the generic bitmask
411       constants via "use Encode qw(:fallback_all)".
412
413                            FB_DEFAULT FB_CROAK FB_QUIET FB_WARN  FB_PERLQQ
414        DIE_ON_ERR    0x0001             X
415        WARN_ON_ERR   0x0002                               X
416        RETURN_ON_ERR 0x0004                      X        X
417        LEAVE_SRC     0x0008                                        X
418        PERLQQ        0x0100                                        X
419        HTMLCREF      0x0200
420        XMLCREF       0x0400
421
422       LEAVE_SRC
423
424         Encode::LEAVE_SRC
425
426       If the "Encode::LEAVE_SRC" bit is not set but CHECK is set, then the
427       source string to encode() or decode() will be overwritten in place.  If
428       you're not interested in this, then bitwise-OR it with the bitmask.
429
430   coderef for CHECK
431       As of "Encode" 2.12, "CHECK" can also be a code reference which takes
432       the ordinal value of the unmapped character as an argument and returns
433       a string that represents the fallback character.  For instance:
434
435         $ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });
436
437       Acts like "FB_PERLQQ" but U+XXXX is used instead of "\x{XXXX}".
438

Defining Encodings

440       To define a new encoding, use:
441
442           use Encode qw(define_encoding);
443           define_encoding($object, CANONICAL_NAME [, alias...]);
444
445       CANONICAL_NAME will be associated with $object.  The object should
446       provide the interface described in Encode::Encoding.  If more than two
447       arguments are provided, additional arguments are considered aliases for
448       $object.
449
450       See Encode::Encoding for details.
451

The UTF8 flag

453       Before the introduction of Unicode support in Perl, The "eq" operator
454       just compared the strings represented by two scalars. Beginning with
455       Perl 5.8, "eq" compares two strings with simultaneous consideration of
456       the UTF8 flag. To explain why we made it so, I quote from page 402 of
457       Programming Perl, 3rd ed.
458
459       Goal #1:
460         Old byte-oriented programs should not spontaneously break on the old
461         byte-oriented data they used to work on.
462
463       Goal #2:
464         Old byte-oriented programs should magically start working on the new
465         character-oriented data when appropriate.
466
467       Goal #3:
468         Programs should run just as fast in the new character-oriented mode
469         as in the old byte-oriented mode.
470
471       Goal #4:
472         Perl should remain one language, rather than forking into a byte-
473         oriented Perl and a character-oriented Perl.
474
475       When Programming Perl, 3rd ed. was written, not even Perl 5.6.0 had
476       been born yet, many features documented in the book remained
477       unimplemented for a long time.  Perl 5.8 corrected much of this, and
478       the introduction of the UTF8 flag is one of them.  You can think of
479       there being two fundamentally different kinds of strings and string-
480       operations in Perl: one a byte-oriented mode  for when the internal
481       UTF8 flag is off, and the other a character-oriented mode for when the
482       internal UTF8 flag is on.
483
484       Here is how "Encode" handles the UTF8 flag.
485
486       · When you encode, the resulting UTF8 flag is always off.
487
488       · When you decode, the resulting UTF8 flag is on--unless you can
489         unambiguously represent data.  Here is what we mean by
490         "unambiguously".  After "$utf8 = decode("foo", $octet)",
491
492           When $octet is...   The UTF8 flag in $utf8 is
493           ---------------------------------------------
494           In ASCII only (or EBCDIC only)            OFF
495           In ISO-8859-1                              ON
496           In any other Encoding                      ON
497           ---------------------------------------------
498
499         As you see, there is one exception: in ASCII.  That way you can
500         assume Goal #1.  And with "Encode", Goal #2 is assumed but you still
501         have to be careful in the cases mentioned in the CAVEAT paragraphs
502         above.
503
504         This UTF8 flag is not visible in Perl scripts, exactly for the same
505         reason you cannot (or rather, you don't have to) see whether a scalar
506         contains a string, an integer, or a floating-point number.   But you
507         can still peek and poke these if you will.  See the next section.
508
509   Messing with Perl's Internals
510       The following API uses parts of Perl's internals in the current
511       implementation.  As such, they are efficient but may change in a future
512       release.
513
514       is_utf8
515
516         is_utf8(STRING [, CHECK])
517
518       [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.  If
519       CHECK is true, also checks whether STRING contains well-formed UTF-8.
520       Returns true if successful, false otherwise.
521
522       As of Perl 5.8.1, utf8 also has the "utf8::is_utf8" function.
523
524       _utf8_on
525
526         _utf8_on(STRING)
527
528       [INTERNAL] Turns the STRING's internal UTF8 flag on.  The STRING is not
529       checked for containing only well-formed UTF-8.  Do not use this unless
530       you know with absolute certainty that the STRING holds only well-formed
531       UTF-8.  Returns the previous state of the UTF8 flag (so please don't
532       treat the return value as indicating success or failure), or "undef" if
533       STRING is not a string.
534
535       NOTE: For security reasons, this function does not work on tainted
536       values.
537
538       _utf8_off
539
540         _utf8_off(STRING)
541
542       [INTERNAL] Turns the STRING's internal UTF8 flag off.  Do not use
543       frivolously.  Returns the previous state of the UTF8 flag, or "undef"
544       if STRING is not a string.  Do not treat the return value as indicative
545       of success or failure, because that isn't what it means: it is only the
546       previous setting.
547
548       NOTE: For security reasons, this function does not work on tainted
549       values.
550

UTF-8 vs. utf8 vs. UTF8

552         ....We now view strings not as sequences of bytes, but as sequences
553         of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
554         computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.
555
556       That has historically been Perl's notion of UTF-8, as that is how UTF-8
557       was first conceived by Ken Thompson when he invented it. However,
558       thanks to later revisions to the applicable standards, official UTF-8
559       is now rather stricter than that. For example, its range is much
560       narrower (0 .. 0x10_FFFF to cover only 21 bits instead of 32 or 64
561       bits) and some sequences are not allowed, like those used in surrogate
562       pairs, the 31 non-character code points 0xFDD0 .. 0xFDEF, the last two
563       code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non-shortest
564       encodings, etc.
565
566       The former default in which Perl would always use a loose
567       interpretation of UTF-8 has now been overruled:
568
569         From: Larry Wall <larry@wall.org>
570         Date: December 04, 2004 11:51:58 JST
571         To: perl-unicode@perl.org
572         Subject: Re: Make Encode.pm support the real UTF-8
573         Message-Id: <20041204025158.GA28754@wall.org>
574
575         On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
576         : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
577         : but "UTF-8" is the name of the standard and should give the
578         : corresponding behaviour.
579
580         For what it's worth, that's how I've always kept them straight in my
581         head.
582
583         Also for what it's worth, Perl 6 will mostly default to strict but
584         make it easy to switch back to lax.
585
586         Larry
587
588       Got that?  As of Perl 5.8.7, "UTF-8" means UTF-8 in its current sense,
589       which is conservative and strict and security-conscious, whereas "utf8"
590       means UTF-8 in its former sense, which was liberal and loose and lax.
591       "Encode" version 2.10 or later thus groks this subtle but critically
592       important distinction between "UTF-8" and "utf8".
593
594         encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
595         encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
596
597       In the "Encode" module, "UTF-8" is actually a canonical name for
598       "utf-8-strict".  That hyphen between the "UTF" and the "8" is critical;
599       without it, "Encode" goes "liberal" and (perhaps overly-)permissive:
600
601         find_encoding("UTF-8")->name # is 'utf-8-strict'
602         find_encoding("utf-8")->name # ditto. names are case insensitive
603         find_encoding("utf_8")->name # ditto. "_" are treated as "-"
604         find_encoding("UTF8")->name  # is 'utf8'.
605
606       Perl's internal UTF8 flag is called "UTF8", without a hyphen. It
607       indicates whether a string is internally encoded as "utf8", also
608       without a hyphen.
609

MAINTAINER

617       This project was originated by the late Nick Ing-Simmons and later
618       maintained by Dan Kogai <dankogai@cpan.org>.  See AUTHORS for a full
619       list of people involved.  For any questions, send mail to
620       <perl-unicode@perl.org> so that we can all share.
621
622       While Dan Kogai retains the copyright as a maintainer, credit should go
623       to all those involved.  See AUTHORS for a list of those who submitted
624       code to the project.
625

COPYRIGHT

627       Copyright 2002-2013 Dan Kogai <dankogai@cpan.org>.
628
629       This library is free software; you can redistribute it and/or modify it
630       under the same terms as Perl itself.
631
632
633
634perl v5.16.3                      2014-06-10                         Encode(3)