Encode(3pm)

1Encode(3)             User Contributed Perl Documentation            Encode(3)
2
3
4

NAME

6       Encode - character encodings in Perl
7

SYNOPSIS

9           use Encode qw(decode encode);
10           $characters = decode('UTF-8', $octets,     Encode::FB_CROAK);
11           $octets     = encode('UTF-8', $characters, Encode::FB_CROAK);
12
13   Table of Contents
14       Encode consists of a collection of modules whose details are too
15       extensive to fit in one document.  This one itself explains the top-
16       level APIs and general topics at a glance.  For other topics and more
17       details, see the documentation for these modules:
18
19       Encode::Alias - Alias definitions to encodings
20       Encode::Encoding - Encode Implementation Base Class
21       Encode::Supported - List of Supported Encodings
22       Encode::CN - Simplified Chinese Encodings
23       Encode::JP - Japanese Encodings
24       Encode::KR - Korean Encodings
25       Encode::TW - Traditional Chinese Encodings
26

DESCRIPTION

28       The "Encode" module provides the interface between Perl strings and the
29       rest of the system.  Perl strings are sequences of characters.
30
31       The repertoire of characters that Perl can represent is a superset of
32       those defined by the Unicode Consortium. On most platforms the ordinal
33       values of a character as returned by "ord(S)" is the Unicode codepoint
34       for that character. The exceptions are platforms where the legacy
35       encoding is some variant of EBCDIC rather than a superset of ASCII; see
36       perlebcdic.
37
38       During recent history, data is moved around a computer in 8-bit chunks,
39       often called "bytes" but also known as "octets" in standards documents.
40       Perl is widely used to manipulate data of many types: not only strings
41       of characters representing human or computer languages, but also
42       "binary" data, being the machine's representation of numbers, pixels in
43       an image, or just about anything.
44
45       When Perl is processing "binary data", the programmer wants Perl to
46       process "sequences of bytes". This is not a problem for Perl: because a
47       byte has 256 possible values, it easily fits in Perl's much larger
48       "logical character".
49
50       This document mostly explains the how. perlunitut and perlunifaq
51       explain the why.
52
53   TERMINOLOGY
54       character
55
56       A character in the range 0 .. 2**32-1 (or more); what Perl's strings
57       are made of.
58
59       byte
60
61       A character in the range 0..255; a special case of a Perl character.
62
63       octet
64
65       8 bits of data, with ordinal values 0..255; term for bytes passed to or
66       from a non-Perl context, such as a disk file, standard I/O stream,
67       database, command-line argument, environment variable, socket etc.
68

THE PERL ENCODING API

70   Basic methods
71       encode
72
73         $octets  = encode(ENCODING, STRING[, CHECK])
74
75       Encodes the scalar value STRING from Perl's internal form into ENCODING
76       and returns a sequence of octets.  ENCODING can be either a canonical
77       name or an alias.  For encoding names and aliases, see "Defining
78       Aliases".  For CHECK, see "Handling Malformed Data".
79
80       CAVEAT: the input scalar STRING might be modified in-place depending on
81       what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be
82       left unchanged.
83
84       For example, to convert a string from Perl's internal format into
85       ISO-8859-1, also known as Latin1:
86
87         $octets = encode("iso-8859-1", $string);
88
89       CAVEAT: When you run "$octets = encode("UTF-8", $string)", then $octets
90       might not be equal to $string.  Though both contain the same data, the
91       UTF8 flag for $octets is always off.  When you encode anything, the
92       UTF8 flag on the result is always off, even when it contains a
93       completely valid UTF-8 string. See "The UTF8 flag" below.
94
95       If the $string is "undef", then "undef" is returned.
96
97       "str2bytes" may be used as an alias for "encode".
98
99       decode
100
101         $string = decode(ENCODING, OCTETS[, CHECK])
102
103       This function returns the string that results from decoding the scalar
104       value OCTETS, assumed to be a sequence of octets in ENCODING, into
105       Perl's internal form.  As with encode(), ENCODING can be either a
106       canonical name or an alias. For encoding names and aliases, see
107       "Defining Aliases"; for CHECK, see "Handling Malformed Data".
108
109       CAVEAT: the input scalar OCTETS might be modified in-place depending on
110       what is set in CHECK. See "LEAVE_SRC" if you want your inputs to be
111       left unchanged.
112
113       For example, to convert ISO-8859-1 data into a string in Perl's
114       internal format:
115
116         $string = decode("iso-8859-1", $octets);
117
118       CAVEAT: When you run "$string = decode("UTF-8", $octets)", then $string
119       might not be equal to $octets.  Though both contain the same data, the
120       UTF8 flag for $string is on.  See "The UTF8 flag" below.
121
122       If the $string is "undef", then "undef" is returned.
123
124       "bytes2str" may be used as an alias for "decode".
125
126       find_encoding
127
128         [$obj =] find_encoding(ENCODING)
129
130       Returns the encoding object corresponding to ENCODING.  Returns "undef"
131       if no matching ENCODING is find.  The returned object is what does the
132       actual encoding or decoding.
133
134         $string = decode($name, $bytes);
135
136       is in fact
137
138           $string = do {
139               $obj = find_encoding($name);
140               croak qq(encoding "$name" not found) unless ref $obj;
141               $obj->decode($bytes);
142           };
143
144       with more error checking.
145
146       You can therefore save time by reusing this object as follows;
147
148           my $enc = find_encoding("iso-8859-1");
149           while(<>) {
150               my $string = $enc->decode($_);
151               ... # now do something with $string;
152           }
153
154       Besides "decode" and "encode", other methods are available as well.
155       For instance, "name()" returns the canonical name of the encoding
156       object.
157
158         find_encoding("latin1")->name; # iso-8859-1
159
160       See Encode::Encoding for details.
161
162       find_mime_encoding
163
164         [$obj =] find_mime_encoding(MIME_ENCODING)
165
166       Returns the encoding object corresponding to MIME_ENCODING.  Acts same
167       as "find_encoding()" but "mime_name()" of returned object must match to
168       MIME_ENCODING.  So as opposite of "find_encoding()" canonical names and
169       aliases are not used when searching for object.
170
171           find_mime_encoding("utf8"); # returns undef because "utf8" is not valid I<MIME_ENCODING>
172           find_mime_encoding("utf-8"); # returns encode object "utf-8-strict"
173           find_mime_encoding("UTF-8"); # same as "utf-8" because I<MIME_ENCODING> is case insensitive
174           find_mime_encoding("utf-8-strict"); returns undef because "utf-8-strict" is not valid I<MIME_ENCODING>
175
176       from_to
177
178         [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
179
180       Converts in-place data between two encodings. The data in $octets must
181       be encoded as octets and not as characters in Perl's internal format.
182       For example, to convert ISO-8859-1 data into Microsoft's CP1250
183       encoding:
184
185         from_to($octets, "iso-8859-1", "cp1250");
186
187       and to convert it back:
188
189         from_to($octets, "cp1250", "iso-8859-1");
190
191       Because the conversion happens in place, the data to be converted
192       cannot be a string constant: it must be a scalar variable.
193
194       "from_to()" returns the length of the converted string in octets on
195       success, and "undef" on error.
196
197       CAVEAT: The following operations may look the same, but are not:
198
199         from_to($data, "iso-8859-1", "UTF-8"); #1
200         $data = decode("iso-8859-1", $data);  #2
201
202       Both #1 and #2 make $data consist of a completely valid UTF-8 string,
203       but only #2 turns the UTF8 flag on.  #1 is equivalent to:
204
205         $data = encode("UTF-8", decode("iso-8859-1", $data));
206
207       See "The UTF8 flag" below.
208
209       Also note that:
210
211         from_to($octets, $from, $to, $check);
212
213       is equivalent to:
214
215         $octets = encode($to, decode($from, $octets), $check);
216
217       Yes, it does not respect the $check during decoding.  It is
218       deliberately done that way.  If you need minute control, use "decode"
219       followed by "encode" as follows:
220
221         $octets = encode($to, decode($from, $octets, $check_from), $check_to);
222
223       encode_utf8
224
225         $octets = encode_utf8($string);
226
227       WARNING: This function can produce invalid UTF-8!  Do not use it for
228       data exchange.  Unless you want Perl's older "lax" mode, prefer
229       "$octets = encode("UTF-8", $string)".
230
231       Equivalent to "$octets = encode("utf8", $string)".  The characters in
232       $string are encoded in Perl's internal format, and the result is
233       returned as a sequence of octets.  Because all possible characters in
234       Perl have a (loose, not strict) utf8 representation, this function
235       cannot fail.
236
237       decode_utf8
238
239         $string = decode_utf8($octets [, CHECK]);
240
241       WARNING: This function accepts invalid UTF-8!  Do not use it for data
242       exchange.  Unless you want Perl's older "lax" mode, prefer "$string =
243       decode("UTF-8", $octets [, CHECK])".
244
245       Equivalent to "$string = decode("utf8", $octets [, CHECK])".  The
246       sequence of octets represented by $octets is decoded from (loose, not
247       strict) utf8 into a sequence of logical characters.  Because not all
248       sequences of octets are valid not strict utf8, it is quite possible for
249       this function to fail.  For CHECK, see "Handling Malformed Data".
250
251       CAVEAT: the input $octets might be modified in-place depending on what
252       is set in CHECK. See "LEAVE_SRC" if you want your inputs to be left
253       unchanged.
254
255   Listing available encodings
256         use Encode;
257         @list = Encode->encodings();
258
259       Returns a list of canonical names of available encodings that have
260       already been loaded.  To get a list of all available encodings
261       including those that have not yet been loaded, say:
262
263         @all_encodings = Encode->encodings(":all");
264
265       Or you can give the name of a specific module:
266
267         @with_jp = Encode->encodings("Encode::JP");
268
269       When ""::"" is not in the name, ""Encode::"" is assumed.
270
271         @ebcdic = Encode->encodings("EBCDIC");
272
273       To find out in detail which encodings are supported by this package,
274       see Encode::Supported.
275
276   Defining Aliases
277       To add a new alias to a given encoding, use:
278
279         use Encode;
280         use Encode::Alias;
281         define_alias(NEWNAME => ENCODING);
282
283       After that, NEWNAME can be used as an alias for ENCODING.  ENCODING may
284       be either the name of an encoding or an encoding object.
285
286       Before you do that, first make sure the alias is nonexistent using
287       "resolve_alias()", which returns the canonical name thereof.  For
288       example:
289
290         Encode::resolve_alias("latin1") eq "iso-8859-1" # true
291         Encode::resolve_alias("iso-8859-12")   # false; nonexistent
292         Encode::resolve_alias($name) eq $name  # true if $name is canonical
293
294       "resolve_alias()" does not need "use Encode::Alias"; it can be imported
295       via "use Encode qw(resolve_alias)".
296
297       See Encode::Alias for details.
298
299   Finding IANA Character Set Registry names
300       The canonical name of a given encoding does not necessarily agree with
301       IANA Character Set Registry, commonly seen as "Content-Type:
302       text/plain; charset=WHATEVER".  For most cases, the canonical name
303       works, but sometimes it does not, most notably with "utf-8-strict".
304
305       As of "Encode" version 2.21, a new method "mime_name()" is therefore
306       added.
307
308         use Encode;
309         my $enc = find_encoding("UTF-8");
310         warn $enc->name;      # utf-8-strict
311         warn $enc->mime_name; # UTF-8
312
313       See also:  Encode::Encoding
314

Encoding via PerlIO

316       If your perl supports "PerlIO" (which is the default), you can use a
317       "PerlIO" layer to decode and encode directly via a filehandle.  The
318       following two examples are fully identical in functionality:
319
320         ### Version 1 via PerlIO
321           open(INPUT,  "< :encoding(shiftjis)", $infile)
322               || die "Can't open < $infile for reading: $!";
323           open(OUTPUT, "> :encoding(euc-jp)",  $outfile)
324               || die "Can't open > $output for writing: $!";
325           while (<INPUT>) {   # auto decodes $_
326               print OUTPUT;   # auto encodes $_
327           }
328           close(INPUT)   || die "can't close $infile: $!";
329           close(OUTPUT)  || die "can't close $outfile: $!";
330
331         ### Version 2 via from_to()
332           open(INPUT,  "< :raw", $infile)
333               || die "Can't open < $infile for reading: $!";
334           open(OUTPUT, "> :raw",  $outfile)
335               || die "Can't open > $output for writing: $!";
336
337           while (<INPUT>) {
338               from_to($_, "shiftjis", "euc-jp", 1);  # switch encoding
339               print OUTPUT;   # emit raw (but properly encoded) data
340           }
341           close(INPUT)   || die "can't close $infile: $!";
342           close(OUTPUT)  || die "can't close $outfile: $!";
343
344       In the first version above, you let the appropriate encoding layer
345       handle the conversion.  In the second, you explicitly translate from
346       one encoding to the other.
347
348       Unfortunately, it may be that encodings are not "PerlIO"-savvy.  You
349       can check to see whether your encoding is supported by "PerlIO" by
350       invoking the "perlio_ok" method on it:
351
352         Encode::perlio_ok("hz");             # false
353         find_encoding("euc-cn")->perlio_ok;  # true wherever PerlIO is available
354
355         use Encode qw(perlio_ok);            # imported upon request
356         perlio_ok("euc-jp")
357
358       Fortunately, all encodings that come with "Encode" core are
359       "PerlIO"-savvy except for "hz" and "ISO-2022-kr".  For the gory
360       details, see Encode::Encoding and Encode::PerlIO.
361

Handling Malformed Data

363       The optional CHECK argument tells "Encode" what to do when encountering
364       malformed data.  Without CHECK, "Encode::FB_DEFAULT" (== 0) is assumed.
365
366       As of version 2.12, "Encode" supports coderef values for "CHECK"; see
367       below.
368
369       NOTE: Not all encodings support this feature.  Some encodings ignore
370       the CHECK argument.  For example, Encode::Unicode ignores CHECK and it
371       always croaks on error.
372
373   List of CHECK values
374       FB_DEFAULT
375
376         I<CHECK> = Encode::FB_DEFAULT ( == 0)
377
378       If CHECK is 0, encoding and decoding replace any malformed character
379       with a substitution character.  When you encode, SUBCHAR is used.  When
380       you decode, the Unicode REPLACEMENT CHARACTER, code point U+FFFD, is
381       used.  If the data is supposed to be UTF-8, an optional lexical warning
382       of warning category "utf8" is given.
383
384       FB_CROAK
385
386         I<CHECK> = Encode::FB_CROAK ( == 1)
387
388       If CHECK is 1, methods immediately die with an error message.
389       Therefore, when CHECK is 1, you should trap exceptions with "eval{}",
390       unless you really want to let it "die".
391
392       FB_QUIET
393
394         I<CHECK> = Encode::FB_QUIET
395
396       If CHECK is set to "Encode::FB_QUIET", encoding and decoding
397       immediately return the portion of the data that has been processed so
398       far when an error occurs. The data argument is overwritten with
399       everything after that point; that is, the unprocessed portion of the
400       data.  This is handy when you have to call "decode" repeatedly in the
401       case where your source data may contain partial multi-byte character
402       sequences, (that is, you are reading with a fixed-width buffer). Here's
403       some sample code to do exactly that:
404
405           my($buffer, $string) = ("", "");
406           while (read($fh, $buffer, 256, length($buffer))) {
407               $string .= decode($encoding, $buffer, Encode::FB_QUIET);
408               # $buffer now contains the unprocessed partial character
409           }
410
411       FB_WARN
412
413         I<CHECK> = Encode::FB_WARN
414
415       This is the same as "FB_QUIET" above, except that instead of being
416       silent on errors, it issues a warning.  This is handy for when you are
417       debugging.
418
419       CAVEAT: All warnings from Encode module are reported, independently of
420       pragma warnings settings. If you want to follow settings of lexical
421       warnings configured by pragma warnings then append also check value
422       "ENCODE::ONLY_PRAGMA_WARNINGS". This value is available since Encode
423       version 2.99.
424
425       FB_PERLQQ FB_HTMLCREF FB_XMLCREF
426
427       perlqq mode (CHECK = Encode::FB_PERLQQ)
428       HTML charref mode (CHECK = Encode::FB_HTMLCREF)
429       XML charref mode (CHECK = Encode::FB_XMLCREF)
430
431       For encodings that are implemented by the "Encode::XS" module, "CHECK"
432       "==" "Encode::FB_PERLQQ" puts "encode" and "decode" into "perlqq"
433       fallback mode.
434
435       When you decode, "\xHH" is inserted for a malformed character, where HH
436       is the hex representation of the octet that could not be decoded to
437       utf8.  When you encode, "\x{HHHH}" will be inserted, where HHHH is the
438       Unicode code point (in any number of hex digits) of the character that
439       cannot be found in the character repertoire of the encoding.
440
441       The HTML/XML character reference modes are about the same. In place of
442       "\x{HHHH}", HTML uses "&#NNN;" where NNN is a decimal number, and XML
443       uses "&#xHHHH;" where HHHH is the hexadecimal number.
444
445       In "Encode" 2.10 or later, "LEAVE_SRC" is also implied.
446
447       The bitmask
448
449       These modes are all actually set via a bitmask.  Here is how the
450       "FB_XXX" constants are laid out.  You can import the "FB_XXX" constants
451       via "use Encode qw(:fallbacks)", and you can import the generic bitmask
452       constants via "use Encode qw(:fallback_all)".
453
454                            FB_DEFAULT FB_CROAK FB_QUIET FB_WARN  FB_PERLQQ
455        DIE_ON_ERR    0x0001             X
456        WARN_ON_ERR   0x0002                               X
457        RETURN_ON_ERR 0x0004                      X        X
458        LEAVE_SRC     0x0008                                        X
459        PERLQQ        0x0100                                        X
460        HTMLCREF      0x0200
461        XMLCREF       0x0400
462
463       LEAVE_SRC
464
465         Encode::LEAVE_SRC
466
467       If the "Encode::LEAVE_SRC" bit is not set but CHECK is set, then the
468       source string to encode() or decode() will be overwritten in place.  If
469       you're not interested in this, then bitwise-OR it with the bitmask.
470
471   coderef for CHECK
472       As of "Encode" 2.12, "CHECK" can also be a code reference which takes
473       the ordinal value of the unmapped character as an argument and returns
474       octets that represent the fallback character.  For instance:
475
476         $ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });
477
478       Acts like "FB_PERLQQ" but U+XXXX is used instead of "\x{XXXX}".
479
480       Fallback for "decode" must return decoded string (sequence of
481       characters) and takes a list of ordinal values as its arguments. So for
482       example if you wish to decode octets as UTF-8, and use ISO-8859-15 as a
483       fallback for bytes that are not valid UTF-8, you could write
484
485           $str = decode 'UTF-8', $octets, sub {
486               my $tmp = join '', map chr, @_;
487               return decode 'ISO-8859-15', $tmp;
488           };
489

Defining Encodings

491       To define a new encoding, use:
492
493           use Encode qw(define_encoding);
494           define_encoding($object, CANONICAL_NAME [, alias...]);
495
496       CANONICAL_NAME will be associated with $object.  The object should
497       provide the interface described in Encode::Encoding.  If more than two
498       arguments are provided, additional arguments are considered aliases for
499       $object.
500
501       See Encode::Encoding for details.
502

The UTF8 flag

504       Before the introduction of Unicode support in Perl, The "eq" operator
505       just compared the strings represented by two scalars. Beginning with
506       Perl 5.8, "eq" compares two strings with simultaneous consideration of
507       the UTF8 flag. To explain why we made it so, I quote from page 402 of
508       Programming Perl, 3rd ed.
509
510       Goal #1:
511         Old byte-oriented programs should not spontaneously break on the old
512         byte-oriented data they used to work on.
513
514       Goal #2:
515         Old byte-oriented programs should magically start working on the new
516         character-oriented data when appropriate.
517
518       Goal #3:
519         Programs should run just as fast in the new character-oriented mode
520         as in the old byte-oriented mode.
521
522       Goal #4:
523         Perl should remain one language, rather than forking into a byte-
524         oriented Perl and a character-oriented Perl.
525
526       When Programming Perl, 3rd ed. was written, not even Perl 5.6.0 had
527       been born yet, many features documented in the book remained
528       unimplemented for a long time.  Perl 5.8 corrected much of this, and
529       the introduction of the UTF8 flag is one of them.  You can think of
530       there being two fundamentally different kinds of strings and string-
531       operations in Perl: one a byte-oriented mode  for when the internal
532       UTF8 flag is off, and the other a character-oriented mode for when the
533       internal UTF8 flag is on.
534
535       This UTF8 flag is not visible in Perl scripts, exactly for the same
536       reason you cannot (or rather, you don't have to) see whether a scalar
537       contains a string, an integer, or a floating-point number.   But you
538       can still peek and poke these if you will.  See the next section.
539
540   Messing with Perl's Internals
541       The following API uses parts of Perl's internals in the current
542       implementation.  As such, they are efficient but may change in a future
543       release.
544
545       is_utf8
546
547         is_utf8(STRING [, CHECK])
548
549       [INTERNAL] Tests whether the UTF8 flag is turned on in the STRING.  If
550       CHECK is true, also checks whether STRING contains well-formed UTF-8.
551       Returns true if successful, false otherwise.
552
553       Typically only necessary for debugging and testing.  Don't use this
554       flag as a marker to distinguish character and binary data, that should
555       be decided for each variable when you write your code.
556
557       CAVEAT: If STRING has UTF8 flag set, it does NOT mean that STRING is
558       UTF-8 encoded and vice-versa.
559
560       As of Perl 5.8.1, utf8 also has the "utf8::is_utf8" function.
561
562       _utf8_on
563
564         _utf8_on(STRING)
565
566       [INTERNAL] Turns the STRING's internal UTF8 flag on.  The STRING is not
567       checked for containing only well-formed UTF-8.  Do not use this unless
568       you know with absolute certainty that the STRING holds only well-formed
569       UTF-8.  Returns the previous state of the UTF8 flag (so please don't
570       treat the return value as indicating success or failure), or "undef" if
571       STRING is not a string.
572
573       NOTE: For security reasons, this function does not work on tainted
574       values.
575
576       _utf8_off
577
578         _utf8_off(STRING)
579
580       [INTERNAL] Turns the STRING's internal UTF8 flag off.  Do not use
581       frivolously.  Returns the previous state of the UTF8 flag, or "undef"
582       if STRING is not a string.  Do not treat the return value as indicative
583       of success or failure, because that isn't what it means: it is only the
584       previous setting.
585
586       NOTE: For security reasons, this function does not work on tainted
587       values.
588

UTF-8 vs. utf8 vs. UTF8

590         ....We now view strings not as sequences of bytes, but as sequences
591         of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
592         computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.
593
594       That has historically been Perl's notion of UTF-8, as that is how UTF-8
595       was first conceived by Ken Thompson when he invented it. However,
596       thanks to later revisions to the applicable standards, official UTF-8
597       is now rather stricter than that. For example, its range is much
598       narrower (0 .. 0x10_FFFF to cover only 21 bits instead of 32 or 64
599       bits) and some sequences are not allowed, like those used in surrogate
600       pairs, the 31 non-character code points 0xFDD0 .. 0xFDEF, the last two
601       code points in any plane (0xXX_FFFE and 0xXX_FFFF), all non-shortest
602       encodings, etc.
603
604       The former default in which Perl would always use a loose
605       interpretation of UTF-8 has now been overruled:
606
607         From: Larry Wall <larry@wall.org>
608         Date: December 04, 2004 11:51:58 JST
609         To: perl-unicode@perl.org
610         Subject: Re: Make Encode.pm support the real UTF-8
611         Message-Id: <20041204025158.GA28754@wall.org>
612
613         On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
614         : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
615         : but "UTF-8" is the name of the standard and should give the
616         : corresponding behaviour.
617
618         For what it's worth, that's how I've always kept them straight in my
619         head.
620
621         Also for what it's worth, Perl 6 will mostly default to strict but
622         make it easy to switch back to lax.
623
624         Larry
625
626       Got that?  As of Perl 5.8.7, "UTF-8" means UTF-8 in its current sense,
627       which is conservative and strict and security-conscious, whereas "utf8"
628       means UTF-8 in its former sense, which was liberal and loose and lax.
629       "Encode" version 2.10 or later thus groks this subtle but critically
630       important distinction between "UTF-8" and "utf8".
631
632         encode("utf8",  "\x{FFFF_FFFF}", 1); # okay
633         encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
634
635       This distinction is also important for decoding. In the following, $s
636       stores character U+200000, which exceeds UTF-8's allowed range.  $s
637       thus stores an invalid Unicode code point:
638
639         $s = decode("utf8", "\xf8\x88\x80\x80\x80");
640
641       "UTF-8", by contrast, will either coerce the input to something valid:
642
643           $s = decode("UTF-8", "\xf8\x88\x80\x80\x80"); # U+FFFD
644
645       .. or croak:
646
647           decode("UTF-8", "\xf8\x88\x80\x80\x80", FB_CROAK|LEAVE_SRC);
648
649       In the "Encode" module, "UTF-8" is actually a canonical name for
650       "utf-8-strict".  That hyphen between the "UTF" and the "8" is critical;
651       without it, "Encode" goes "liberal" and (perhaps overly-)permissive:
652
653         find_encoding("UTF-8")->name # is 'utf-8-strict'
654         find_encoding("utf-8")->name # ditto. names are case insensitive
655         find_encoding("utf_8")->name # ditto. "_" are treated as "-"
656         find_encoding("UTF8")->name  # is 'utf8'.
657
658       Perl's internal UTF8 flag is called "UTF8", without a hyphen. It
659       indicates whether a string is internally encoded as "utf8", also
660       without a hyphen.
661

MAINTAINER

669       This project was originated by the late Nick Ing-Simmons and later
670       maintained by Dan Kogai <dankogai@cpan.org>.  See AUTHORS for a full
671       list of people involved.  For any questions, send mail to
672       <perl-unicode@perl.org> so that we can all share.
673
674       While Dan Kogai retains the copyright as a maintainer, credit should go
675       to all those involved.  See AUTHORS for a list of those who submitted
676       code to the project.
677

COPYRIGHT

679       Copyright 2002-2014 Dan Kogai <dankogai@cpan.org>.
680
681       This library is free software; you can redistribute it and/or modify it
682       under the same terms as Perl itself.
683
684
685
686perl v5.34.1                      2022-04-07                         Encode(3)