1Encode(3pm) Perl Programmers Reference Guide Encode(3pm)
2
3
4
6 Encode - character encodings
7
9 use Encode;
10
11 Table of Contents
12
13 Encode consists of a collection of modules whose details are too big to
14 fit in one document. This POD itself explains the top-level APIs and
15 general topics at a glance. For other topics and more details, see the
16 PODs below:
17
18 Name Description
19 --------------------------------------------------------
20 Encode::Alias Alias definitions to encodings
21 Encode::Encoding Encode Implementation Base Class
22 Encode::Supported List of Supported Encodings
23 Encode::CN Simplified Chinese Encodings
24 Encode::JP Japanese Encodings
25 Encode::KR Korean Encodings
26 Encode::TW Traditional Chinese Encodings
27 --------------------------------------------------------
28
30 The "Encode" module provides the interfaces between Perl's strings and
31 the rest of the system. Perl strings are sequences of characters.
32
33 The repertoire of characters that Perl can represent is at least that
34 defined by the Unicode Consortium. On most platforms the ordinal values
35 of the characters (as returned by "ord(ch)") is the "Unicode codepoint"
36 for the character (the exceptions are those platforms where the legacy
37 encoding is some variant of EBCDIC rather than a super-set of ASCII -
38 see perlebcdic).
39
40 Traditionally, computer data has been moved around in 8-bit chunks
41 often called "bytes". These chunks are also known as "octets" in net‐
42 working standards. Perl is widely used to manipulate data of many types
43 - not only strings of characters representing human or computer lan‐
44 guages but also "binary" data being the machine's representation of
45 numbers, pixels in an image - or just about anything.
46
47 When Perl is processing "binary data", the programmer wants Perl to
48 process "sequences of bytes". This is not a problem for Perl - as a
49 byte has 256 possible values, it easily fits in Perl's much larger
50 "logical character".
51
52 TERMINOLOGY
53
54 · character: a character in the range 0..(2**32-1) (or more). (What
55 Perl's strings are made of.)
56
57 · byte: a character in the range 0..255 (A special case of a Perl char‐
58 acter.)
59
60 · octet: 8 bits of data, with ordinal values 0..255 (Term for bytes
61 passed to or from a non-Perl context, e.g. a disk file.)
62
64 $octets = encode(ENCODING, $string [, CHECK])
65 Encodes a string from Perl's internal form into ENCODING and returns
66 a sequence of octets. ENCODING can be either a canonical name or an
67 alias. For encoding names and aliases, see "Defining Aliases". For
68 CHECK, see "Handling Malformed Data".
69
70 For example, to convert a string from Perl's internal format to
71 iso-8859-1 (also known as Latin1),
72
73 $octets = encode("iso-8859-1", $string);
74
75 CAVEAT: When you run "$octets = encode("utf8", $string)", then
76 $octets may not be equal to $string. Though they both contain the
77 same data, the utf8 flag for $octets is always off. When you encode
78 anything, utf8 flag of the result is always off, even when it con‐
79 tains completely valid utf8 string. See "The UTF-8 flag" below.
80
81 If the $string is "undef" then "undef" is returned.
82
83 $string = decode(ENCODING, $octets [, CHECK])
84 Decodes a sequence of octets assumed to be in ENCODING into Perl's
85 internal form and returns the resulting string. As in encode(),
86 ENCODING can be either a canonical name or an alias. For encoding
87 names and aliases, see "Defining Aliases". For CHECK, see "Handling
88 Malformed Data".
89
90 For example, to convert ISO-8859-1 data to a string in Perl's inter‐
91 nal format:
92
93 $string = decode("iso-8859-1", $octets);
94
95 CAVEAT: When you run "$string = decode("utf8", $octets)", then
96 $string may not be equal to $octets. Though they both contain the
97 same data, the utf8 flag for $string is on unless $octets entirely
98 consists of ASCII data (or EBCDIC on EBCDIC machines). See "The
99 UTF-8 flag" below.
100
101 If the $string is "undef" then "undef" is returned.
102
103 [$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
104 Converts in-place data between two encodings. The data in $octets
105 must be encoded as octets and not as characters in Perl's internal
106 format. For example, to convert ISO-8859-1 data to Microsoft's CP1250
107 encoding:
108
109 from_to($octets, "iso-8859-1", "cp1250");
110
111 and to convert it back:
112
113 from_to($octets, "cp1250", "iso-8859-1");
114
115 Note that because the conversion happens in place, the data to be
116 converted cannot be a string constant; it must be a scalar variable.
117
118 from_to() returns the length of the converted string in octets on
119 success, undef on error.
120
121 CAVEAT: The following operations look the same but are not quite so;
122
123 from_to($data, "iso-8859-1", "utf8"); #1
124 $data = decode("iso-8859-1", $data); #2
125
126 Both #1 and #2 make $data consist of a completely valid UTF-8 string
127 but only #2 turns utf8 flag on. #1 is equivalent to
128
129 $data = encode("utf8", decode("iso-8859-1", $data));
130
131 See "The UTF-8 flag" below.
132
133 $octets = encode_utf8($string);
134 Equivalent to "$octets = encode("utf8", $string);" The characters
135 that comprise $string are encoded in Perl's internal format and the
136 result is returned as a sequence of octets. All possible characters
137 have a UTF-8 representation so this function cannot fail.
138
139 $string = decode_utf8($octets [, CHECK]);
140 equivalent to "$string = decode("utf8", $octets [, CHECK])". The
141 sequence of octets represented by $octets is decoded from UTF-8 into
142 a sequence of logical characters. Not all sequences of octets form
143 valid UTF-8 encodings, so it is possible for this call to fail. For
144 CHECK, see "Handling Malformed Data".
145
146 Listing available encodings
147
148 use Encode;
149 @list = Encode->encodings();
150
151 Returns a list of the canonical names of the available encodings that
152 are loaded. To get a list of all available encodings including the
153 ones that are not loaded yet, say
154
155 @all_encodings = Encode->encodings(":all");
156
157 Or you can give the name of a specific module.
158
159 @with_jp = Encode->encodings("Encode::JP");
160
161 When "::" is not in the name, "Encode::" is assumed.
162
163 @ebcdic = Encode->encodings("EBCDIC");
164
165 To find out in detail which encodings are supported by this package,
166 see Encode::Supported.
167
168 Defining Aliases
169
170 To add a new alias to a given encoding, use:
171
172 use Encode;
173 use Encode::Alias;
174 define_alias(newName => ENCODING);
175
176 After that, newName can be used as an alias for ENCODING. ENCODING may
177 be either the name of an encoding or an encoding object
178
179 But before you do so, make sure the alias is nonexistent with
180 "resolve_alias()", which returns the canonical name thereof. i.e.
181
182 Encode::resolve_alias("latin1") eq "iso-8859-1" # true
183 Encode::resolve_alias("iso-8859-12") # false; nonexistent
184 Encode::resolve_alias($name) eq $name # true if $name is canonical
185
186 resolve_alias() does not need "use Encode::Alias"; it can be exported
187 via "use Encode qw(resolve_alias)".
188
189 See Encode::Alias for details.
190
192 If your perl supports PerlIO (which is the default), you can use a Per‐
193 lIO layer to decode and encode directly via a filehandle. The follow‐
194 ing two examples are totally identical in their functionality.
195
196 # via PerlIO
197 open my $in, "<:encoding(shiftjis)", $infile or die;
198 open my $out, ">:encoding(euc-jp)", $outfile or die;
199 while(<$in>){ print $out $_; }
200
201 # via from_to
202 open my $in, "<", $infile or die;
203 open my $out, ">", $outfile or die;
204 while(<$in>){
205 from_to($_, "shiftjis", "euc-jp", 1);
206 print $out $_;
207 }
208
209 Unfortunately, it may be that encodings are PerlIO-savvy. You can
210 check if your encoding is supported by PerlIO by calling the "per‐
211 lio_ok" method.
212
213 Encode::perlio_ok("hz"); # False
214 find_encoding("euc-cn")->perlio_ok; # True where PerlIO is available
215
216 use Encode qw(perlio_ok); # exported upon request
217 perlio_ok("euc-jp")
218
219 Fortunately, all encodings that come with Encode core are PerlIO-savvy
220 except for hz and ISO-2022-kr. For gory details, see Encode::Encoding
221 and Encode::PerlIO.
222
224 The optional CHECK argument tells Encode what to do when it encounters
225 malformed data. Without CHECK, Encode::FB_DEFAULT ( == 0 ) is assumed.
226
227 As of version 2.12 Encode supports coderef values for CHECK. See
228 below.
229
230 NOTE: Not all encoding support this feature
231 Some encodings ignore CHECK argument. For example, Encode::Unicode
232 ignores CHECK and it always croaks on error.
233
234 Now here is the list of CHECK values available
235
236 CHECK = Encode::FB_DEFAULT ( == 0)
237 If CHECK is 0, (en⎪de)code will put a substitution character in place
238 of a malformed character. When you encode, <subchar> will be used.
239 When you decode the code point 0xFFFD is used. If the data is sup‐
240 posed to be UTF-8, an optional lexical warning (category utf8) is
241 given.
242
243 CHECK = Encode::FB_CROAK ( == 1)
244 If CHECK is 1, methods will die on error immediately with an error
245 message. Therefore, when CHECK is set to 1, you should trap the
246 error with eval{} unless you really want to let it die.
247
248 CHECK = Encode::FB_QUIET
249 If CHECK is set to Encode::FB_QUIET, (en⎪de)code will immediately
250 return the portion of the data that has been processed so far when an
251 error occurs. The data argument will be overwritten with everything
252 after that point (that is, the unprocessed part of data). This is
253 handy when you have to call decode repeatedly in the case where your
254 source data may contain partial multi-byte character sequences, (i.e.
255 you are reading with a fixed-width buffer). Here is a sample code
256 that does exactly this:
257
258 my $buffer = ''; my $string = '';
259 while(read $fh, $buffer, 256, length($buffer)){
260 $string .= decode($encoding, $buffer, Encode::FB_QUIET);
261 # $buffer now contains the unprocessed partial character
262 }
263
264 CHECK = Encode::FB_WARN
265 This is the same as above, except that it warns on error. Handy when
266 you are debugging the mode above.
267
268 perlqq mode (CHECK = Encode::FB_PERLQQ)
269 HTML charref mode (CHECK = Encode::FB_HTMLCREF)
270 XML charref mode (CHECK = Encode::FB_XMLCREF)
271 For encodings that are implemented by Encode::XS, CHECK ==
272 Encode::FB_PERLQQ turns (en⎪de)code into "perlqq" fallback mode.
273
274 When you decode, "\xHH" will be inserted for a malformed character,
275 where HH is the hex representation of the octet that could not be
276 decoded to utf8. And when you encode, "\x{HHHH}" will be inserted,
277 where HHHH is the Unicode ID of the character that cannot be found in
278 the character repertoire of the encoding.
279
280 HTML/XML character reference modes are about the same, in place of
281 "\x{HHHH}", HTML uses "&#NNN;" where NNN is a decimal number and XML
282 uses "&#xHHHH;" where HHHH is the hexadecimal number.
283
284 In Encode 2.10 or later, "LEAVE_SRC" is also implied.
285
286 The bitmask
287 These modes are actually set via a bitmask. Here is how the FB_XX
288 constants are laid out. You can import the FB_XX constants via "use
289 Encode qw(:fallbacks)"; you can import the generic bitmask constants
290 via "use Encode qw(:fallback_all)".
291
292 FB_DEFAULT FB_CROAK FB_QUIET FB_WARN FB_PERLQQ
293 DIE_ON_ERR 0x0001 X
294 WARN_ON_ERR 0x0002 X
295 RETURN_ON_ERR 0x0004 X X
296 LEAVE_SRC 0x0008 X
297 PERLQQ 0x0100 X
298 HTMLCREF 0x0200
299 XMLCREF 0x0400
300
301 coderef for CHECK
302
303 As of Encode 2.12 CHECK can also be a code reference which takes the
304 ord value of unmapped caharacter as an argument and returns a string
305 that represents the fallback character. For instance,
306
307 $ascii = encode("ascii", $utf8, sub{ sprintf "<U+%04X>", shift });
308
309 Acts like FB_PERLQQ but <U+XXXX> is used instead of \x{XXXX}.
310
312 To define a new encoding, use:
313
314 use Encode qw(define_encoding);
315 define_encoding($object, 'canonicalName' [, alias...]);
316
317 canonicalName will be associated with $object. The object should pro‐
318 vide the interface described in Encode::Encoding. If more than two
319 arguments are provided then additional arguments are taken as aliases
320 for $object.
321
322 See Encode::Encoding for more details.
323
325 Before the introduction of utf8 support in perl, The "eq" operator just
326 compared the strings represented by two scalars. Beginning with perl
327 5.8, "eq" compares two strings with simultaneous consideration of the
328 utf8 flag. To explain why we made it so, I will quote page 402 of "Pro‐
329 gramming Perl, 3rd ed."
330
331 Goal #1:
332 Old byte-oriented programs should not spontaneously break on the old
333 byte-oriented data they used to work on.
334
335 Goal #2:
336 Old byte-oriented programs should magically start working on the new
337 character-oriented data when appropriate.
338
339 Goal #3:
340 Programs should run just as fast in the new character-oriented mode
341 as in the old byte-oriented mode.
342
343 Goal #4:
344 Perl should remain one language, rather than forking into a byte-ori‐
345 ented Perl and a character-oriented Perl.
346
347 Back when "Programming Perl, 3rd ed." was written, not even Perl 5.6.0
348 was born and many features documented in the book remained unimple‐
349 mented for a long time. Perl 5.8 corrected this and the introduction
350 of the UTF-8 flag is one of them. You can think of this perl notion as
351 of a byte-oriented mode (utf8 flag off) and a character-oriented mode
352 (utf8 flag on).
353
354 Here is how Encode takes care of the utf8 flag.
355
356 · When you encode, the resulting utf8 flag is always off.
357
358 · When you decode, the resulting utf8 flag is on unless you can unam‐
359 biguously represent data. Here is the definition of dis-ambiguity.
360
361 After "$utf8 = decode('foo', $octet);",
362
363 When $octet is... The utf8 flag in $utf8 is
364 ---------------------------------------------
365 In ASCII only (or EBCDIC only) OFF
366 In ISO-8859-1 ON
367 In any other Encoding ON
368 ---------------------------------------------
369
370 As you see, there is one exception, In ASCII. That way you can
371 assume Goal #1. And with Encode Goal #2 is assumed but you still
372 have to be careful in such cases mentioned in CAVEAT paragraphs.
373
374 This utf8 flag is not visible in perl scripts, exactly for the same
375 reason you cannot (or you don't have to) see if a scalar contains a
376 string, integer, or floating point number. But you can still peek
377 and poke these if you will. See the section below.
378
379 Messing with Perl's Internals
380
381 The following API uses parts of Perl's internals in the current imple‐
382 mentation. As such, they are efficient but may change.
383
384 is_utf8(STRING [, CHECK])
385 [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
386 If CHECK is true, also checks the data in STRING for being well-
387 formed UTF-8. Returns true if successful, false otherwise.
388
389 As of perl 5.8.1, utf8 also has utf8::is_utf8().
390
391 _utf8_on(STRING)
392 [INTERNAL] Turns on the UTF-8 flag in STRING. The data in STRING is
393 not checked for being well-formed UTF-8. Do not use unless you know
394 that the STRING is well-formed UTF-8. Returns the previous state of
395 the UTF-8 flag (so please don't treat the return value as indicating
396 success or failure), or "undef" if STRING is not a string.
397
398 _utf8_off(STRING)
399 [INTERNAL] Turns off the UTF-8 flag in STRING. Do not use
400 frivolously. Returns the previous state of the UTF-8 flag (so please
401 don't treat the return value as indicating success or failure), or
402 "undef" if STRING is not a string.
403
405 ....We now view strings not as sequences of bytes, but as sequences
406 of numbers in the range 0 .. 2**32-1 (or in the case of 64-bit
407 computers, 0 .. 2**64-1) -- Programming Perl, 3rd ed.
408
409 That has been the perl's notion of UTF-8 but official UTF-8 is more
410 strict; Its ranges is much narrower (0 .. 10FFFF), some sequences are
411 not allowed (i.e. Those used in the surrogate pair, 0xFFFE, et al).
412
413 Now that is overruled by Larry Wall himself.
414
415 From: Larry Wall <larry@wall.org>
416 Date: December 04, 2004 11:51:58 JST
417 To: perl-unicode@perl.org
418 Subject: Re: Make Encode.pm support the real UTF-8
419 Message-Id: <20041204025158.GA28754@wall.org>
420
421 On Fri, Dec 03, 2004 at 10:12:12PM +0000, Tim Bunce wrote:
422 : I've no problem with 'utf8' being perl's unrestricted uft8 encoding,
423 : but "UTF-8" is the name of the standard and should give the
424 : corresponding behaviour.
425
426 For what it's worth, that's how I've always kept them straight in my
427 head.
428
429 Also for what it's worth, Perl 6 will mostly default to strict but
430 make it easy to switch back to lax.
431
432 Larry
433
434 Do you copy? As of Perl 5.8.7, UTF-8 means strict, official UTF-8
435 while utf8 means liberal, lax, version thereof. And Encode version
436 2.10 or later thus groks the difference between "UTF-8" and C"utf8".
437
438 encode("utf8", "\x{FFFF_FFFF}", 1); # okay
439 encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
440
441 "UTF-8" in Encode is actually a canonical name for "utf-8-strict".
442 Yes, the hyphen between "UTF" and "8" is important. Without it Encode
443 goes "liberal"
444
445 find_encoding("UTF-8")->name # is 'utf-8-strict'
446 find_encoding("utf-8")->name # ditto. names are case insensitive
447 find_encoding("utf8")->name # ditto. "_" are treated as "-"
448 find_encoding("UTF8")->name # is 'utf8'.
449
451 Encode::Encoding, Encode::Supported, Encode::PerlIO, encoding, per‐
452 lebcdic, "open" in perlfunc, perlunicode, utf8, the Perl Unicode Mail‐
453 ing List <perl-unicode@perl.org>
454
456 This project was originated by Nick Ing-Simmons and later maintained by
457 Dan Kogai <dankogai@dan.co.jp>. See AUTHORS for a full list of people
458 involved. For any questions, use <perl-unicode@perl.org> so we can all
459 share.
460
461
462
463perl v5.8.8 2001-09-21 Encode(3pm)