1MIME::Charset(3)      User Contributed Perl Documentation     MIME::Charset(3)
2
3
4

NAME

6       MIME::Charset - Charset Information for MIME
7

SYNOPSIS

9           use MIME::Charset:
10
11           $charset = MIME::Charset->new("euc-jp");
12
13       Getting charset information:
14
15           $benc = $charset->body_encoding; # e.g. "Q"
16           $cset = $charset->as_string; # e.g. "US-ASCII"
17           $henc = $charset->header_encoding; # e.g. "S"
18           $cset = $charset->output_charset; # e.g. "ISO-2022-JP"
19
20       Translating text data:
21
22           ($text, $charset, $encoding) =
23               $charset->header_encode(
24                  "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa".
25                  "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef",
26                  Charset => 'euc-jp');
27           # ...returns e.g. (<converted>, "ISO-2022-JP", "B").
28
29           ($text, $charset, $encoding) =
30               $charset->body_encode(
31                   "Collectioneur path\xe9tiquement ".
32                   "\xe9clectique de d\xe9chets",
33                   Charset => 'latin1');
34           # ...returns e.g. (<original>, "ISO-8859-1", "QUOTED-PRINTABLE").
35
36           $len = $charset->encoded_header_len(
37               "Perl\xe8\xa8\x80\xe8\xaa\x9e",
38               Charset => 'utf-8',
39               Encoding => "b");
40           # ...returns e.g. 28.
41
42       Manipulating module defaults:
43
44           MIME::Charset::alias("csEUCKR", "euc-kr");
45           MIME::Charset::default("iso-8859-1");
46           MIME::Charset::fallback("us-ascii");
47
48       Non-OO functions (may be deprecated in near future):
49
50           use MIME::Charset qw(:info);
51
52           $benc = body_encoding("iso-8859-2"); # "Q"
53           $cset = canonical_charset("ANSI X3.4-1968"); # "US-ASCII"
54           $henc = header_encoding("utf-8"); # "S"
55           $cset = output_charset("shift_jis"); # "ISO-2022-JP"
56
57           use MIME::Charset qw(:trans);
58
59           ($text, $charset, $encoding) =
60               header_encode(
61                  "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa".
62                  "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef",
63                  "euc-jp");
64           # ...returns (<converted>, "ISO-2022-JP", "B");
65
66           ($text, $charset, $encoding) =
67               body_encode(
68                   "Collectioneur path\xe9tiquement ".
69                   "\xe9clectique de d\xe9chets",
70                   "latin1");
71           # ...returns (<original>, "ISO-8859-1", "QUOTED-PRINTABLE");
72
73           $len = encoded_header_len(
74               "Perl\xe8\xa8\x80\xe8\xaa\x9e", "b", "utf-8"); # 28
75

DESCRIPTION

77       MIME::Charset provides information about character sets used for MIME
78       messages on Internet.
79
80   Definitions
81       The charset is ``character set'' used in MIME to refer to a method of
82       converting a sequence of octets into a sequence of characters.  It
83       includes both concepts of ``coded character set'' (CCS) and ``character
84       encoding scheme'' (CES) of ISO/IEC.
85
86       The encoding is that used in MIME to refer to a method of representing
87       a body part or a header body as sequence(s) of printable US-ASCII
88       characters.
89
90   Constructor
91       $charset = MIME::Charset->new([CHARSET [, OPTS]])
92           Create charset object.
93
94           OPTS may accept following key-value pair.  NOTE: When
95           Unicode/multibyte support is disabled (see "USE_ENCODE"),
96           conversion will not be performed.  So this option do not have any
97           effects.
98
99           Mapping => MAPTYPE
100               Whether to extend mappings actually used for charset names or
101               not.  "EXTENDED" uses extended mappings.  "STANDARD" uses
102               standardized strict mappings.  Default is "EXTENDED".
103
104   Getting Information of Charsets
105       $charset->body_encoding
106       body_encoding CHARSET
107           Get recommended transfer-encoding of CHARSET for message body.
108
109           Returned value will be one of "B" (BASE64), "Q" (QUOTED-PRINTABLE),
110           "S" (shorter one of either) or "undef" (might not be transfer-
111           encoded; either 7BIT or 8BIT).  This may not be same as encoding
112           for message header.
113
114       $charset->as_string
115       canonical_charset CHARSET
116           Get canonical name for charset.
117
118       $charset->decoder
119           Get "Encode::Encoding" object to decode strings to Unicode by
120           charset.  If charset is not specified or not known by this module,
121           undef will be returned.
122
123       $charset->dup
124           Get a copy of charset object.
125
126       $charset->encoder([CHARSET])
127           Get "Encode::Encoding" object to encode Unicode string using
128           compatible charset recommended to be used for messages on Internet.
129
130           If optional CHARSET is specified, replace encoder (and output
131           charset name) of $charset object with those of CHARSET, therefore,
132           $charset object will be a converter between original charset and
133           new CHARSET.
134
135       $charset->header_encoding
136       header_encoding CHARSET
137           Get recommended encoding scheme of CHARSET for message header.
138
139           Returned value will be one of "B", "Q", "S" (shorter one of either)
140           or "undef" (might not be encoded).  This may not be same as
141           encoding for message body.
142
143       $charset->output_charset
144       output_charset CHARSET
145           Get a charset which is compatible with given CHARSET and is
146           recommended to be used for MIME messages on Internet (if it is
147           known by this module).
148
149           When Unicode/multibyte support is disabled (see "USE_ENCODE"), this
150           function will simply return the result of "canonical_charset".
151
152   Translating Text Data
153       $charset->body_encode(STRING [, OPTS])
154       body_encode STRING, CHARSET [, OPTS]
155           Get converted (if needed) data of STRING and recommended transfer-
156           encoding of that data for message body.  CHARSET is the charset by
157           which STRING is encoded.
158
159           OPTS may accept following key-value pairs.  NOTE: When
160           Unicode/multibyte support is disabled (see "USE_ENCODE"),
161           conversion will not be performed.  So these options do not have any
162           effects.
163
164           Detect7bit => YESNO
165               Try auto-detecting 7-bit charset when CHARSET is not given.
166               Default is "YES".
167
168           Replacement => REPLACEMENT
169               Specifies error handling scheme.  See "Error Handling".
170
171           3-item list of (converted string, charset for output, transfer-
172           encoding) will be returned.  Transfer-encoding will be either
173           "BASE64", "QUOTED-PRINTABLE", "7BIT" or "8BIT".  If charset for
174           output could not be determined and converted string contains non-
175           ASCII byte(s), charset for output will be "undef" and transfer-
176           encoding will be "BASE64".  Charset for output will be "US-ASCII"
177           if and only if string does not contain any non-ASCII bytes.
178
179       $charset->decode(STRING [,CHECK])
180           Decode STRING to Unicode.
181
182           Note: When Unicode/multibyte support is disabled (see
183           "USE_ENCODE"), this function will die.
184
185       detect_7bit_charset STRING
186           Guess 7-bit charset that may encode a string STRING.  If STRING
187           contains any 8-bit bytes, "undef" will be returned.  Otherwise,
188           Default Charset will be returned for unknown charset.
189
190       $charset->encode(STRING [, CHECK])
191           Encode STRING (Unicode or non-Unicode) using compatible charset
192           recommended to be used for messages on Internet (if this module
193           knows it).  Note that string will be decoded to Unicode then
194           encoded even if compatible charset was equal to original charset.
195
196           Note: When Unicode/multibyte support is disabled (see
197           "USE_ENCODE"), this function will die.
198
199       $charset->encoded_header_len(STRING [, ENCODING])
200       encoded_header_len STRING, ENCODING, CHARSET
201           Get length of encoded STRING for message header (without folding).
202
203           ENCODING may be one of "B", "Q" or "S" (shorter one of either "B"
204           or "Q").
205
206       $charset->header_encode(STRING [, OPTS])
207       header_encode STRING, CHARSET [, OPTS]
208           Get converted (if needed) data of STRING and recommended encoding
209           scheme of that data for message headers.  CHARSET is the charset by
210           which STRING is encoded.
211
212           OPTS may accept following key-value pairs.  NOTE: When
213           Unicode/multibyte support is disabled (see "USE_ENCODE"),
214           conversion will not be performed.  So these options do not have any
215           effects.
216
217           Detect7bit => YESNO
218               Try auto-detecting 7-bit charset when CHARSET is not given.
219               Default is "YES".
220
221           Replacement => REPLACEMENT
222               Specifies error handling scheme.  See "Error Handling".
223
224           3-item list of (converted string, charset for output, encoding
225           scheme) will be returned.  Encoding scheme will be either "B", "Q"
226           or "undef" (might not be encoded).  If charset for output could not
227           be determined and converted string contains non-ASCII byte(s),
228           charset for output will be "8BIT" (this is not charset name but a
229           special value to represent unencodable data) and encoding scheme
230           will be "undef" (should not be encoded).  Charset for output will
231           be "US-ASCII" if and only if string does not contain any non-ASCII
232           bytes.
233
234       $charset->undecode(STRING [,CHECK])
235           Encode Unicode string STRING to byte string by input charset of
236           $charset.  This is equivalent to "$charset->decoder->encode()".
237
238           Note: When Unicode/multibyte support is disabled (see
239           "USE_ENCODE"), this function will die.
240
241   Manipulating Module Defaults
242       alias ALIAS [, CHARSET]
243           Get/set charset alias for canonical names determined by
244           "canonical_charset".
245
246           If CHARSET is given and isn't false, ALIAS will be assigned as an
247           alias of CHARSET.  Otherwise, alias won't be changed.  In both
248           cases, current charset name that ALIAS is assigned will be
249           returned.
250
251       default [CHARSET]
252           Get/set default charset.
253
254           Default charset is used by this module when charset context is
255           unknown.  Modules using this module are recommended to use this
256           charset when charset context is unknown or implicit default is
257           expected.  By default, it is "US-ASCII".
258
259           If CHARSET is given and isn't false, it will be set to default
260           charset.  Otherwise, default charset won't be changed.  In both
261           cases, current default charset will be returned.
262
263           NOTE: Default charset should not be changed.
264
265       fallback [CHARSET]
266           Get/set fallback charset.
267
268           Fallback charset is used by this module when conversion by given
269           charset is failed and "FALLBACK" error handling scheme is
270           specified.  Modules using this module may use this charset as last
271           resort of charset for conversion.  By default, it is "UTF-8".
272
273           If CHARSET is given and isn't false, it will be set to fallback
274           charset.  If CHARSET is "NONE", fallback charset will be undefined.
275           Otherwise, fallback charset won't be changed.  In any cases,
276           current fallback charset will be returned.
277
278           NOTE: It is useful that "US-ASCII" is specified as fallback
279           charset, since result of conversion will be readable without
280           charset information.
281
282       recommended CHARSET [, HEADERENC, BODYENC [, ENCCHARSET]]
283           Get/set charset profiles.
284
285           If optional arguments are given and any of them are not false,
286           profiles for CHARSET will be set by those arguments.  Otherwise,
287           profiles won't be changed.  In both cases, current profiles for
288           CHARSET will be returned as 3-item list of (HEADERENC, BODYENC,
289           ENCCHARSET).
290
291           HEADERENC is recommended encoding scheme for message header.  It
292           may be one of "B", "Q", "S" (shorter one of either) or "undef"
293           (might not be encoded).
294
295           BODYENC is recommended transfer-encoding for message body.  It may
296           be one of "B", "Q", "S" (shorter one of either) or "undef" (might
297           not be transfer-encoded).
298
299           ENCCHARSET is a charset which is compatible with given CHARSET and
300           is recommended to be used for MIME messages on Internet.  If
301           conversion is not needed (or this module doesn't know appropriate
302           charset), ENCCHARSET is "undef".
303
304           NOTE: This function in the future releases can accept more optional
305           arguments (for example, properties to handle character widths, line
306           folding behavior, ...).  So format of returned value may probably
307           be changed.  Use "header_encoding", "body_encoding" or
308           "output_charset" to get particular profile.
309
310   Constants
311       USE_ENCODE
312           Unicode/multibyte support flag.  Non-empty string will be set when
313           Unicode and multibyte support is enabled.  Currently, this flag
314           will be non-empty on Perl 5.7.3 or later and empty string on
315           earlier versions of Perl.
316
317   Error Handling
318       "body_encode" and "header_encode" accept following "Replacement"
319       options:
320
321       "DEFAULT"
322           Put a substitution character in place of a malformed character.
323           For UCM-based encodings, <subchar> will be used.
324
325       "FALLBACK"
326           Try "DEFAULT" scheme using fallback charset (see "fallback").  When
327           fallback charset is undefined and conversion causes error, code
328           will die on error with an error message.
329
330       "CROAK"
331           Code will die on error immediately with an error message.
332           Therefore, you should trap the fatal error with eval{} unless you
333           really want to let it die on error.  Synonym is "STRICT".
334
335       "PERLQQ"
336       "HTMLCREF"
337       "XMLCREF"
338           Use "FB_PERLQQ", "FB_HTMLCREF" or "FB_XMLCREF" scheme defined by
339           Encode module.
340
341       numeric values
342           Numeric values are also allowed.  For more details see "Handling
343           Malformed Data" in Encode.
344
345       If error handling scheme is not specified or unknown scheme is
346       specified, "DEFAULT" will be assumed.
347
348   Configuration File
349       Built-in defaults for option parameters can be overridden by
350       configuration file: MIME/Charset/Defaults.pm.  For more details read
351       MIME/Charset/Defaults.pm.sample.
352

VERSION

354       Consult $VERSION variable.
355
356       Development versions of this module may be found at
357       <http://hatuka.nezumi.nu/repos/MIME-Charset/>.
358
359   Incompatible Changes
360       Release 1.001
361           ·   new() method returns an object when CHARSET argument is not
362               specified.
363
364       Release 1.005
365           ·   Restrict characters in encoded-word according to RFC 2047
366               section 5 (3).  This also affects return value of
367               encoded_header_len() method.
368
369       Release 1.008.2
370           ·   body_encoding() method may also returns "S".
371
372           ·   Return value of body_encode() method for UTF-8 may include
373               "QUOTED-PRINTABLE" encoding item that in earlier versions was
374               fixed to "BASE64".
375

SEE ALSO

377       Multipurpose Internet Mail Extensions (MIME).
378

AUTHOR

380       Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>
381
383       Copyright (C) 2006-2017 Hatuka*nezumi - IKEDA Soji.  This program is
384       free software; you can redistribute it and/or modify it under the same
385       terms as Perl itself.
386
387
388
389perl v5.32.0                      2020-07-28                  MIME::Charset(3)
Impressum