1Encode(3)             User Contributed Perl Documentation            Encode(3)
2
3
4

NAME

6       Encode - character encodings in Perl
7

SYNOPSIS

9           use Encode qw(decode encode);
10           $characters = decode('UTF-8', $octets,     Encode::FB_CROAK);
11           $octets     = encode('UTF-8', $characters, Encode::FB_CROAK);
12
13   Table of Contents
14       Encode consists of a collection of modules whose details are too
15       extensive to fit in one document.  This one itself explains the top-
16       level APIs and general topics at a glance.  For other topics and more
17       details, see the documentation for these modules:
18
19       Encode::Alias - Alias definitions to encodings
20       Encode::Encoding - Encode Implementation Base Class
21       Encode::Supported - List of Supported Encodings
22       Encode::CN - Simplified Chinese Encodings
23       Encode::JP - Japanese Encodings
24       Encode::KR - Korean Encodings
25       Encode::TW - Traditional Chinese Encodings
26

DESCRIPTION

28       The "Encode" module provides the interface between Perl strings and the
29       rest of the system.  Perl strings are sequences of characters.
30
31       The repertoire of characters that Perl can represent is a superset of
32       those defined by the Unicode Consortium. On most platforms the ordinal
33       values of a character as returned by ord(S) is the Unicode codepoint
34       for that character. The exceptions are platforms where the legacy
35       encoding is some variant of EBCDIC rather than a superset of ASCII; see
36       perlebcdic.
37
38       During recent history, data is moved around a computer in 8-bit chunks,
39       often called "bytes" but also known as "octets" in standards documents.
40       Perl is widely used to manipulate data of many types: not only strings
41       of characters representing human or computer languages, but also
42       "binary" data, being the machine's representation of numbers, pixels in
43       an image, or just about anything.
44
45       When Perl is processing "binary data", the programmer wants Perl to
46       process "sequences of bytes". This is not a problem for Perl: because a
47       byte has 256 possible values, it easily fits in Perl's much larger
48       "logical character".
49
50       This document mostly explains the how. perlunitut and perlunifaq
51       explain the why.
52
53   TERMINOLOGY
54       character
55
56       A character in the range 0 .. 2**32-1 (or more); what Perl's strings
57       are made of.
58
59       byte
60
61       A character in the range 0..255; a special case of a Perl character.
62
63       octet
64
65       8 bits of data, with ordinal values 0..255; term for bytes passed to or
66       from a non-Perl context, such as a disk file, standard I/O stream,
67       database, command-line argument, environment variable, socket etc.
68

THE PERL ENCODING API

70   Basic methods
71       encode
72
73         $octets  = encode(ENCODING, STRING[, CHECK])
74
75       Encodes the scalar value STRING from Perl's internal form into ENCODING
76       and returns a sequence of octets.  ENCODING can be either a canonical
77       name or an alias.  For encoding names and aliases, see "Defining
78       Aliases".  For CHECK, see "Handling Malformed Data".
79
80       CAVEAT: the input scalar STRING might be modified in-place depending on
81       what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be
82       left unchanged.
83
84       For example, to convert a string from Perl's internal format into
85       ISO-8859-1, also known as Latin1:
86
87         $octets = encode("iso-8859-1", $string);
88
89       CAVEAT: When you run "$octets = encode("UTF-8", $string)", then $octets
90       might not be equal to $string.  Though both contain the same data, the
91       UTF8 flag for $octets is always off.  When you encode anything, the
92       UTF8 flag on the result is always off, even when it contains a
93       completely valid UTF-8 string. See "The UTF8 flag" below.
94
95       If the $string is "undef", then "undef" is returned.
96
97       "str2bytes" may be used as an alias for "encode".
98
99       decode
100
101         $string = decode(ENCODING, OCTETS[, CHECK])
102
103       This function returns the string that results from decoding the scalar
104       value OCTETS, assumed to be a sequence of octets in ENCODING, into
105       Perl's internal form.  As with encode(), ENCODING can be either a
106       canonical name or an alias. For encoding names and aliases, see
107       "Defining Aliases"; for CHECK, see "Handling Malformed Data".
108
109       CAVEAT: the input scalar OCTETS might be modified in-place depending on
110       what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be
111       left unchanged.
112
113       For example, to convert ISO-8859-1 data into a string in Perl's
114       internal format:
115
116         $string = decode("iso-8859-1", $octets);
117
118       CAVEAT: When you run "$string = decode("UTF-8", $octets)", then $string
119       might not be equal to $octets.  Though both contain the same data, the
120       UTF8 flag for $string is on.  See "The UTF8 flag" below.
121
122       If the $string is "undef", then "undef" is returned.
123
124       "bytes2str" may be used as an alias for "decode".
125
126       find_encoding
127
128         [$obj =] find_encoding(ENCODING)
129
130       Returns the encoding object corresponding to ENCODING.  Returns "undef"
131       if no matching ENCODING is find.  The returned object is what does the
132       actual encoding or decoding.
133
134         $string = decode($name, $bytes);
135
136       is in fact
137
138           $string = do {
139               $obj = find_encoding($name);
140               croak qq(encoding "$name" not found) unless ref $obj;
141               $obj->decode($bytes);
142           };
143
144       with more error checking.
145
146       You can therefore save time by reusing this object as follows;
147
148           my $enc = find_encoding("iso-8859-1");
149           while(<>) {
150               my $string = $enc->decode($_);
151               ... # now do something with $string;
152           }
153
154       Besides "decode" and "encode", other methods are available as well.
155       For instance, name() returns the canonical name of the encoding object.
156
157         find_encoding("latin1")->name; # iso-8859-1
158
159       See Encode::Encoding for details.
160
161       find_mime_encoding
162
163         [$obj =] find_mime_encoding(MIME_ENCODING)
164
165       Returns the encoding object corresponding to MIME_ENCODING.  Acts same
166       as find_encoding() but mime_name() of returned object must match to
167       MIME_ENCODING.  So as opposite of find_encoding() canonical names and
168       aliases are not used when searching for object.
169
170           find_mime_encoding("utf8"); # returns undef because "utf8" is not valid I<MIME_ENCODING>
171           find_mime_encoding("utf-8"); # returns encode object "utf-8-strict"
172           find_mime_encoding("UTF-8"); # same as "utf-8" because I<MIME_ENCODING> is case insensitive
173           find_mime_encoding("utf-8-strict"); returns undef because "utf-8-strict" is not valid I<MIME_ENCODING>
174
175       from_to
176
177         [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
178
179       Converts in-place data between two encodings. The data in $octets must
180       be encoded as octets and not as characters in Perl's internal format.
181       For example, to convert ISO-8859-1 data into Microsoft's CP1250
182       encoding:
183
184         from_to($octets, "iso-8859-1", "cp1250");
185
186       and to convert it back:
187
188         from_to($octets, "cp1250", "iso-8859-1");
189
190       Because the conversion happens in place, the data to be converted
191       cannot be a string constant: it must be a scalar variable.
192
193       from_to() returns the length of the converted string in octets on
194       success, and "undef" on error.
195
196       CAVEAT: The following operations may look the same, but are not:
197
198         from_to($data, "iso-8859-1", "UTF-8"); #1
199         $data = decode("iso-8859-1", $data);  #2
200
201       Both #1 and #2 make $data consist of a completely valid UTF-8 string,
202       but only #2 turns the UTF8 flag on.  #1 is equivalent to:
203
204         $data = encode("UTF-8", decode("iso-8859-1", $data));
205
206       See "The UTF8 flag" below.
207
208       Also note that:
209
210         from_to($octets, $from, $to, $check);
211
212       is equivalent to:
213
214         $octets = encode($to, decode($from, $octets), $check);
215
216       Yes, it does not respect the $check during decoding.  It is
217       deliberately done that way.  If you need minute control, use "decode"
218       followed by "encode" as follows:
219
220         $octets = encode($to, decode($from, $octets, $check_from), $check_to);
221
222       encode_utf8
223
224         $octets = encode_utf8($string);
225
226       WARNING: This function can produce invalid UTF-8!  Do not use it for
227       data exchange.  Unless you want Perl's older "lax" mode, prefer
228       "$octets = encode("UTF-8", $string)".
229
230       Equivalent to "$octets = encode("utf8", $string)".  The characters in
231       $string are encoded in Perl's internal format, and the result is
232       returned as a sequence of octets.  Because all possible characters in
233       Perl have a (loose, not strict) utf8 representation, this function
234       cannot fail.
235
236       decode_utf8
237
238         $string = decode_utf8($octets [, CHECK]);
239
240       WARNING: This function accepts invalid UTF-8!  Do not use it for data
241       exchange.  Unless you want Perl's older "lax" mode, prefer "$string =
242       decode("UTF-8", $octets [, CHECK])".
243
244       Equivalent to "$string = decode("utf8", $octets [, CHECK])".  The
245       sequence of octets represented by $octets is decoded from (loose, not
246       strict) utf8 into a sequence of logical characters.  Because not all
247       sequences of octets are valid not strict utf8, it is quite possible for
248       this function to fail.  For CHECK, see "Handling Malformed Data".
249
250       CAVEAT: the input $octets might be modified in-place depending on what
251       is set in CHECK. See "LEAVE_SRC" if you want your inputs to be left
252       unchanged.
253
254   Listing available encodings
255         use Encode;
256         @list = Encode->encodings();
257
258       Returns a list of canonical names of available encodings that have
259       already been loaded.  To get a list of all available encodings
260       including those that have not yet been loaded, say:
261
262         @all_encodings = Encode->encodings(":all");
263
264       Or you can give the name of a specific module:
265
266         @with_jp = Encode->encodings("Encode::JP");
267
268       When ""::"" is not in the name, ""Encode::"" is assumed.
269
270         @ebcdic = Encode->encodings("EBCDIC");
271
272       To find out in detail which encodings are supported by this package,
273       see Encode::Supported.
274
275   Defining Aliases
276       To add a new alias to a given encoding, use:
277
278         use Encode;
279         use Encode::Alias;
280         define_alias(NEWNAME => ENCODING);
281
282       After that, NEWNAME can be used as an alias for ENCODING.  ENCODING may
283       be either the name of an encoding or an encoding object.
284
285       Before you do that, first make sure the alias is nonexistent using
286       resolve_alias(), which returns the canonical name thereof.  For
287       example:
288
289         Encode::resolve_alias("latin1") eq "iso-8859-1" # true
290         Encode::resolve_alias("iso-8859-12")   # false; nonexistent
291         Encode::resolve_alias($name) eq $name  # true if $name is canonical
292
293       resolve_alias() does not need "use Encode::Alias"; it can be imported
294       via "use Encode qw(resolve_alias)".
295
296       See Encode::Alias for details.
297
298   Finding IANA Character Set Registry names
299       The canonical name of a given encoding does not necessarily agree with
300       IANA Character Set Registry, commonly seen as "Content-Type:
301       text/plain; charset=WHATEVER".  For most cases, the canonical name
302       works, but sometimes it does not, most notably with "utf-8-strict".
303
304       As of "Encode" version 2.21, a new method mime_name() is therefore
305       added.
306
307         use Encode;
308         my $enc = find_encoding("UTF-8");
309         warn $enc->name;      # utf-8-strict
310         warn $enc->mime_name; # UTF-8
311
312       See also:  Encode::Encoding
313

Encoding via PerlIO

315       If your perl supports "PerlIO" (which is the default), you can use a
316       "PerlIO" layer to decode and encode directly via a filehandle.  The
317       following two examples are fully identical in functionality:
318
319         ### Version 1 via PerlIO
320           open(INPUT,  "< :encoding(shiftjis)", $infile)
321               || die "Can't open < $infile for reading: $!";
322           open(OUTPUT, "> :encoding(euc-jp)",  $outfile)
323               || die "Can't open > $output for writing: $!";
324           while (<INPUT>) {   # auto decodes $_
325               print OUTPUT;   # auto encodes $_
326           }
327           close(INPUT)   || die "can't close $infile: $!";
328           close(OUTPUT)  || die "can't close $outfile: $!";
329
330         ### Version 2 via from_to()
331           open(INPUT,  "< :raw", $infile)
332               || die "Can't open < $infile for reading: $!";
333           open(OUTPUT, "> :raw",  $outfile)
334               || die "Can't open > $output for writing: $!";
335
336           while (<INPUT>) {
337               from_to($_, "shiftjis", "euc-jp", 1);  # switch encoding
338               print OUTPUT;   # emit raw (but properly encoded) data
339           }
340           close(INPUT)   || die "can't close $infile: $!";
341           close(OUTPUT)  || die "can't close $outfile: $!";
342
343       In the first version above, you let the appropriate encoding layer
344       handle the conversion.  In the second, you explicitly translate from
345       one encoding to the other.
346
347       Unfortunately, it may be that encodings are not "PerlIO"-savvy.  You
348       can check to see whether your encoding is supported by "PerlIO" by
349       invoking the "perlio_ok" method on it:
350
351         Encode::perlio_ok("hz");             # false
352         find_encoding("euc-cn")->perlio_ok;  # true wherever PerlIO is available
353
354         use Encode qw(perlio_ok);            # imported upon request
355         perlio_ok("euc-jp")
356
357       Fortunately, all encodings that come with "Encode" core are
358       "PerlIO"-savvy except for "hz" and "ISO-2022-kr".  For the gory
359       details, see Encode::Encoding and Encode::PerlIO.
360

Handling Malformed Data

362       The optional CHECK argument tells "Encode" what to do when encountering
363       malformed data.  Without CHECK, "Encode::FB_DEFAULT" (== 0) is assumed.
364
365       As of version 2.12, "Encode" supports coderef values for "CHECK"; see
366       below.
367
368       NOTE: Not all encodings support this feature.  Some encodings ignore
369       the CHECK argument.  For example, Encode::Unicode ignores CHECK and it
370       always croaks on error.
371
372   List of CHECK values
373       FB_DEFAULT
374
375         I<CHECK> = Encode::FB_DEFAULT ( == 0)
376
377       If CHECK is 0, encoding and decoding replace any malformed character
378       with a substitution character.  When you encode, SUBCHAR is used.  When
379       you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is
380       used.  If the data is supposed to be UTF-8, an optional lexical warning
381       of warning category "utf8" is given.
382
383       FB_CROAK
384
385         I<CHECK> = Encode::FB_CROAK ( == 1)
386
387       If CHECK is 1, methods immediately die with an error message.
388       Therefore, when CHECK is 1, you should trap exceptions with "eval{}",
389       unless you really want to let it "die".
390
391       FB_QUIET
392
393         I<CHECK> = Encode::FB_QUIET
394
395       If CHECK is set to "Encode::FB_QUIET", encoding and decoding
396       immediately return the portion of the data that has been processed so
397       far when an error occurs. The data argument is overwritten with
398       everything after that point; that is, the unprocessed portion of the
399       data.  This is handy when you have to call "decode" repeatedly in the
400       case where your source data may contain partial multi-byte character
401       sequences, (that is, you are reading with a fixed-width buffer). Here's
402       some sample code to do exactly that:
403
404           my($buffer, $string) = ("", "");
405           while (read($fh, $buffer, 256, length($buffer))) {
406               $string .= decode($encoding, $buffer, Encode::FB_QUIET);
407               # $buffer now contains the unprocessed partial character
408           }
409
410       FB_WARN
411
412         I<CHECK> = Encode::FB_WARN
413
414       This is the same as "FB_QUIET" above, except that instead of being
415       silent on errors, it issues a warning.  This is handy for when you are
416       debugging.
417
418       CAVEAT: All warnings from Encode module are reported, independently of
419       pragma warnings settings. If you want to follow settings of lexical
420       warnings configured by pragma warnings then append also check value
421       "ENCODE::ONLY_PRAGMA_WARNINGS". This value is available since Encode
422       version 2.99.
423
424       FB_PERLQQ FB_HTMLCREF FB_XMLCREF
425
426       perlqq mode (CHECK = Encode::FB_PERLQQ)
427       HTML charref mode (CHECK = Encode::FB_HTMLCREF)
428       XML charref mode (CHECK = Encode::FB_XMLCREF)
429
430       For encodings that are implemented by the "Encode::XS" module, "CHECK"
431       "==" "Encode::FB_PERLQQ" puts "encode" and "decode" into "perlqq"
432       fallback mode.
433
434       When you decode, "\xHH" is inserted for a malformed character, where HH
435       is the hex representation of the octet that could not be decoded to
436       utf8.  When you encode, "\x{HHHH}" will be inserted, where HHHH is the
437       Unicode code point (in any number of hex digits) of the character that
438       cannot be found in the character repertoire of the encoding.
439
440       The HTML/XML character reference modes are about the same. In place of
441       "\x{HHHH}", HTML uses "&#NNN;" where NNN is a decimal number, and XML
442       uses "&#xHHHH;" where HHHH is the hexadecimal number.
443
444       In "Encode" 2.10 or later, "LEAVE_SRC" is also implied.
445
446       The bitmask
447
448       These modes are all actually set via a bitmask.  Here is how the
449       "FB_XXX" constants are laid out.  You can import the "FB_XXX" constants
450       via "use Encode qw(:fallbacks)", and you can import the generic bitmask
451       constants via "use Encode qw(:fallback_all)".
452
453                            FB_DEFAULT FB_CROAK FB_QUIET FB_WARN  FB_PERLQQ
454        DIE_ON_ERR    0x0001             X
455        WARN_ON_ERR   0x0002                               X
456        RETURN_ON_ERR 0x0004                      X        X
457        LEAVE_SRC     0x0008                                        X
458        PERLQQ        0x0100                                        X
459        HTMLCREF      0x0200
460        XMLCREF       0x0400
461
462       LEAVE_SRC
463
464         Encode::LEAVE_SRC
465
466       If the "Encode::LEAVE_SRC" bit is not set but CHECK is set, then the
467       source string to encode() or decode() will be overwritten in place.  If
468       you're not interested in this, then bitwise-OR it with the bitmask.
469
470   coderef for CHECK
471       As of "Encode" 2.12, "CHECK" can also be a code reference which takes
472       the ordinal value of the unmapped character as an argument and returns
473       octets that represent the fallback character.  For instance:
474
475         $ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });
476
477       Acts like "FB_PERLQQ" but U+XXXX is used instead of "\x{XXXX}".
478
479       Fallback for "decode" must return decoded string (sequence of
480       characters) and takes a list of ordinal values as its arguments. So for
481       example if you wish to decode octets as UTF-8, and use ISO-8859-15 as a
482       fallback for bytes that are not valid UTF-8, you could write
483
484           $str = decode 'UTF-8', $octets, sub {
485               my $tmp = join '', map chr, @_;
486               return decode 'ISO-8859-15', $tmp;
487           };
488

Defining Encodings

490       To define a new encoding, use:
491
492           use Encode qw(define_encoding);
493           define_encoding($object, CANONICAL_NAME [, alias...]);
494
495       CANONICAL_NAME will be associated with $object.  The object should
496       provide the interface described in Encode::Encoding.  If more than two
497       arguments are provided, additional arguments are considered aliases for
498       $object.
499
500       See Encode::Encoding for details.
501

The UTF8 flag

503       Before the introduction of Unicode support in Perl, The "eq" operator
504       just compared the strings represented by two scalars. Beginning with
505       Perl 5.8, "eq" compares two strings with simultaneous consideration of
506       the UTF8 flag. To explain why we made it so, I quote from page 402 of
507       Programming Perl, 3rd ed.
508
509       Goal #1:
510         Old byte-oriented programs should not spontaneously break on the old
511         byte-oriented data they used to work on.
512
513       Goal #2:
514         Old byte-oriented programs should magically start working on the new
515         character-oriented data when appropriate.
516
517       Goal #3:
518         Programs should run just as fast in the new character-oriented mode
519         as in the old byte-oriented mode.
520
521       Goal #4:
522         Perl should remain one language, rather than forking into a byte-
523         oriented Perl and a character-oriented Perl.
524
525       When Programming Perl, 3rd ed. was written, not even Perl 5.6.0 had
526       been born yet, many features documented in the book remained
527       unimplemented for a long time.  Perl 5.8 corrected much of this, and
528       the introduction of the UTF8 flag is one of them.  You can think of
529       there being two fundamentally different kinds of strings and string-
530       operations in Perl: one a byte-oriented mode  for when the internal
531       UTF8 flag is off, and the other a character-oriented mode for when the
532       internal UTF8 flag is on.
533
534       This UTF8 flag is not visible in Perl scripts, exactly for the same
535       reason you cannot (or rather, you don't have to) see whether a scalar
536       contains a string, an integer, or a floating-point number.   But you
537       can still peek and poke these if you will.  See the next section.
538
539   Messing with Perl's Internals
540       The following API uses parts of Perl's internals in the current
541       implementation.  As such, they are efficient but may change in a future
542       release.
543
544       is_utf8
545
546         is_utf8(STRING [, CHECK])
547
548       [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.  If
549       CHECK is true, also checks whether STRING contains well-formed UTF-8.
550       Returns true if successful, false otherwise.
551
552       Typically only necessary for debugging and testing.  Don't use this
553       flag as a marker to distinguish character and binary data, that should
554       be decided for each variable when you write your code.
555
556       CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is
557       UTF-8 encoded and vice-versa.
558
559       As of Perl 5.8.1, utf8 also has the "utf8::is_utf8" function.
560
561       _utf8_on
562
563         _utf8_on(STRING)
564
565       [INTERNAL] Turns the STRING's internal UTF8 flag on.  The STRING is not
566       checked for containing only well-formed UTF-8.  Do not use this unless
567       you know with absolute certainty that the STRING holds only well-formed
568       UTF-8.  Returns the previous state of the UTF8 flag (so please don't
569       treat the return value as indicating success or failure), or "undef" if
570       STRING is not a string.
571
572       NOTE: For security reasons, this function does not work on tainted
573       values.
574
575       _utf8_off
576
577         _utf8_off(STRING)
578
579       [INTERNAL] Turns the STRING's internal UTF8 flag off.  Do not use
580       frivolously.  Returns the previous state of the UTF8 flag, or "undef"
581       if STRING is not a string.  Do not treat the return value as indicative
582       of success or failure, because that isn't what it means: it is only the
583       previous setting.
584
585       NOTE: For security reasons, this function does not work on tainted
586       values.
587

UTF-8 vs. utf8 vs. UTF8

589         ....We now view strings not as sequences of bytes, but as sequences
590         of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
591         computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.
592
593       That has historically been Perl's notion of UTF-8, as that is how UTF-8
594       was first conceived by Ken Thompson when he invented it. However,
595       thanks to later revisions to the applicable standards, official UTF-8
596       is now rather stricter than that. For example, its range is much
597       narrower (0 .. 0x10_FFFF to cover only 21 bits instead of 32 or 64
598       bits) and some sequences are not allowed, like those used in surrogate
599       pairs, the 31 non-character code points 0xFDD0 .. 0xFDEF, the last two
600       code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non-shortest
601       encodings, etc.
602
603       The former default in which Perl would always use a loose
604       interpretation of UTF-8 has now been overruled:
605
606         From: Larry Wall <larry@wall.org>
607         Date: December 04, 2004 11:51:58 JST
608         To: perl-unicode@perl.org
609         Subject: Re: Make Encode.pm support the real UTF-8
610         Message-Id: <20041204025158.GA28754@wall.org>
611
612         On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
613         : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
614         : but "UTF-8" is the name of the standard and should give the
615         : corresponding behaviour.
616
617         For what it's worth, that's how I've always kept them straight in my
618         head.
619
620         Also for what it's worth, Perl 6 will mostly default to strict but
621         make it easy to switch back to lax.
622
623         Larry
624
625       Got that?  As of Perl 5.8.7, "UTF-8" means UTF-8 in its current sense,
626       which is conservative and strict and security-conscious, whereas "utf8"
627       means UTF-8 in its former sense, which was liberal and loose and lax.
628       "Encode" version 2.10 or later thus groks this subtle but critically
629       important distinction between "UTF-8" and "utf8".
630
631         encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
632         encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
633
634       This distinction is also important for decoding. In the following, $s
635       stores character U+200000, which exceeds UTF-8's allowed range.  $s
636       thus stores an invalid Unicode code point:
637
638         $s = decode("utf8", "\xf8\x88\x80\x80\x80");
639
640       "UTF-8", by contrast, will either coerce the input to something valid:
641
642           $s = decode("UTF-8", "\xf8\x88\x80\x80\x80"); # U+FFFD
643
644       .. or croak:
645
646           decode("UTF-8", "\xf8\x88\x80\x80\x80", FB_CROAK|LEAVE_SRC);
647
648       In the "Encode" module, "UTF-8" is actually a canonical name for
649       "utf-8-strict".  That hyphen between the "UTF" and the "8" is critical;
650       without it, "Encode" goes "liberal" and (perhaps overly-)permissive:
651
652         find_encoding("UTF-8")->name # is 'utf-8-strict'
653         find_encoding("utf-8")->name # ditto. names are case insensitive
654         find_encoding("utf_8")->name # ditto. "_" are treated as "-"
655         find_encoding("UTF8")->name  # is 'utf8'.
656
657       Perl's internal UTF8 flag is called "UTF8", without a hyphen. It
658       indicates whether a string is internally encoded as "utf8", also
659       without a hyphen.
660

SEE ALSO

662       Encode::Encoding, Encode::Supported, Encode::PerlIO, encoding,
663       perlebcdic, "open" in perlfunc, perlunicode, perluniintro, perlunifaq,
664       perlunitut utf8, the Perl Unicode Mailing List
665       <http://lists.perl.org/list/perl-unicode.html>
666

MAINTAINER

668       This project was originated by the late Nick Ing-Simmons and later
669       maintained by Dan Kogai <dankogai@cpan.org>.  See AUTHORS for a full
670       list of people involved.  For any questions, send mail to
671       <perl-unicode@perl.org> so that we can all share.
672
673       While Dan Kogai retains the copyright as a maintainer, credit should go
674       to all those involved.  See AUTHORS for a list of those who submitted
675       code to the project.
676
678       Copyright 2002-2014 Dan Kogai <dankogai@cpan.org>.
679
680       This library is free software; you can redistribute it and/or modify it
681       under the same terms as Perl itself.
682
683
684
685perl v5.36.0                      2023-01-20                         Encode(3)
Impressum