1MIME::Charset(3) User Contributed Perl Documentation MIME::Charset(3)
2
3
4
6 MIME::Charset - Charset Information for MIME
7
9 use MIME::Charset:
10
11 $charset = MIME::Charset->new("euc-jp");
12
13 Getting charset information:
14
15 $benc = $charset->body_encoding; # e.g. "Q"
16 $cset = $charset->as_string; # e.g. "US-ASCII"
17 $henc = $charset->header_encoding; # e.g. "S"
18 $cset = $charset->output_charset; # e.g. "ISO-2022-JP"
19
20 Translating text data:
21
22 ($text, $charset, $encoding) =
23 $charset->header_encode(
24 "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa".
25 "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef",
26 Charset => 'euc-jp');
27 # ...returns e.g. (<converted>, "ISO-2022-JP", "B").
28
29 ($text, $charset, $encoding) =
30 $charset->body_encode(
31 "Collectioneur path\xe9tiquement ".
32 "\xe9clectique de d\xe9chets",
33 Charset => 'latin1');
34 # ...returns e.g. (<original>, "ISO-8859-1", "QUOTED-PRINTABLE").
35
36 $len = $charset->encoded_header_len(
37 "Perl\xe8\xa8\x80\xe8\xaa\x9e",
38 Charset => 'utf-8',
39 Encoding => "b");
40 # ...returns e.g. 28.
41
42 Manipulating module defaults:
43
44 MIME::Charset::alias("csEUCKR", "euc-kr");
45 MIME::Charset::default("iso-8859-1");
46 MIME::Charset::fallback("us-ascii");
47
48 Non-OO functions (may be deprecated in near future):
49
50 use MIME::Charset qw(:info);
51
52 $benc = body_encoding("iso-8859-2"); # "Q"
53 $cset = canonical_charset("ANSI X3.4-1968"); # "US-ASCII"
54 $henc = header_encoding("utf-8"); # "S"
55 $cset = output_charset("shift_jis"); # "ISO-2022-JP"
56
57 use MIME::Charset qw(:trans);
58
59 ($text, $charset, $encoding) =
60 header_encode(
61 "\xc9\xc2\xc5\xaa\xc0\xde\xc3\xef\xc5\xaa".
62 "\xc7\xd1\xca\xaa\xbd\xd0\xce\xcf\xb4\xef",
63 "euc-jp");
64 # ...returns (<converted>, "ISO-2022-JP", "B");
65
66 ($text, $charset, $encoding) =
67 body_encode(
68 "Collectioneur path\xe9tiquement ".
69 "\xe9clectique de d\xe9chets",
70 "latin1");
71 # ...returns (<original>, "ISO-8859-1", "QUOTED-PRINTABLE");
72
73 $len = encoded_header_len(
74 "Perl\xe8\xa8\x80\xe8\xaa\x9e", "b", "utf-8"); # 28
75
77 MIME::Charset provides information about character sets used for MIME
78 messages on Internet.
79
80 Definitions
81 The charset is ``character set'' used in MIME to refer to a method of
82 converting a sequence of octets into a sequence of characters. It
83 includes both concepts of ``coded character set'' (CCS) and ``character
84 encoding scheme'' (CES) of ISO/IEC.
85
86 The encoding is that used in MIME to refer to a method of representing
87 a body part or a header body as sequence(s) of printable US-ASCII
88 characters.
89
90 Constructor
91 $charset = MIME::Charset->new([CHARSET [, OPTS]])
92 Create charset object.
93
94 OPTS may accept following key-value pair. NOTE: When
95 Unicode/multibyte support is disabled (see "USE_ENCODE"),
96 conversion will not be performed. So this option do not have any
97 effects.
98
99 Mapping => MAPTYPE
100 Whether to extend mappings actually used for charset names or
101 not. "EXTENDED" uses extended mappings. "STANDARD" uses
102 standardized strict mappings. Default is "EXTENDED".
103
104 Getting Information of Charsets
105 $charset->body_encoding
106 body_encoding CHARSET
107 Get recommended transfer-encoding of CHARSET for message body.
108
109 Returned value will be one of "B" (BASE64), "Q" (QUOTED-PRINTABLE),
110 "S" (shorter one of either) or "undef" (might not be transfer-
111 encoded; either 7BIT or 8BIT). This may not be same as encoding
112 for message header.
113
114 $charset->as_string
115 canonical_charset CHARSET
116 Get canonical name for charset.
117
118 $charset->decoder
119 Get "Encode::Encoding" object to decode strings to Unicode by
120 charset. If charset is not specified or not known by this module,
121 undef will be returned.
122
123 $charset->dup
124 Get a copy of charset object.
125
126 $charset->encoder([CHARSET])
127 Get "Encode::Encoding" object to encode Unicode string using
128 compatible charset recommended to be used for messages on Internet.
129
130 If optional CHARSET is specified, replace encoder (and output
131 charset name) of $charset object with those of CHARSET, therefore,
132 $charset object will be a converter between original charset and
133 new CHARSET.
134
135 $charset->header_encoding
136 header_encoding CHARSET
137 Get recommended encoding scheme of CHARSET for message header.
138
139 Returned value will be one of "B", "Q", "S" (shorter one of either)
140 or "undef" (might not be encoded). This may not be same as
141 encoding for message body.
142
143 $charset->output_charset
144 output_charset CHARSET
145 Get a charset which is compatible with given CHARSET and is
146 recommended to be used for MIME messages on Internet (if it is
147 known by this module).
148
149 When Unicode/multibyte support is disabled (see "USE_ENCODE"), this
150 function will simply return the result of "canonical_charset".
151
152 Translating Text Data
153 $charset->body_encode(STRING [, OPTS])
154 body_encode STRING, CHARSET [, OPTS]
155 Get converted (if needed) data of STRING and recommended transfer-
156 encoding of that data for message body. CHARSET is the charset by
157 which STRING is encoded.
158
159 OPTS may accept following key-value pairs. NOTE: When
160 Unicode/multibyte support is disabled (see "USE_ENCODE"),
161 conversion will not be performed. So these options do not have any
162 effects.
163
164 Detect7bit => YESNO
165 Try auto-detecting 7-bit charset when CHARSET is not given.
166 Default is "YES".
167
168 Replacement => REPLACEMENT
169 Specifies error handling scheme. See "Error Handling".
170
171 3-item list of (converted string, charset for output, transfer-
172 encoding) will be returned. Transfer-encoding will be either
173 "BASE64", "QUOTED-PRINTABLE", "7BIT" or "8BIT". If charset for
174 output could not be determined and converted string contains non-
175 ASCII byte(s), charset for output will be "undef" and transfer-
176 encoding will be "BASE64". Charset for output will be "US-ASCII"
177 if and only if string does not contain any non-ASCII bytes.
178
179 $charset->decode(STRING [,CHECK])
180 Decode STRING to Unicode.
181
182 Note: When Unicode/multibyte support is disabled (see
183 "USE_ENCODE"), this function will die.
184
185 detect_7bit_charset STRING
186 Guess 7-bit charset that may encode a string STRING. If STRING
187 contains any 8-bit bytes, "undef" will be returned. Otherwise,
188 Default Charset will be returned for unknown charset.
189
190 $charset->encode(STRING [, CHECK])
191 Encode STRING (Unicode or non-Unicode) using compatible charset
192 recommended to be used for messages on Internet (if this module
193 knows it). Note that string will be decoded to Unicode then
194 encoded even if compatible charset was equal to original charset.
195
196 Note: When Unicode/multibyte support is disabled (see
197 "USE_ENCODE"), this function will die.
198
199 $charset->encoded_header_len(STRING [, ENCODING])
200 encoded_header_len STRING, ENCODING, CHARSET
201 Get length of encoded STRING for message header (without folding).
202
203 ENCODING may be one of "B", "Q" or "S" (shorter one of either "B"
204 or "Q").
205
206 $charset->header_encode(STRING [, OPTS])
207 header_encode STRING, CHARSET [, OPTS]
208 Get converted (if needed) data of STRING and recommended encoding
209 scheme of that data for message headers. CHARSET is the charset by
210 which STRING is encoded.
211
212 OPTS may accept following key-value pairs. NOTE: When
213 Unicode/multibyte support is disabled (see "USE_ENCODE"),
214 conversion will not be performed. So these options do not have any
215 effects.
216
217 Detect7bit => YESNO
218 Try auto-detecting 7-bit charset when CHARSET is not given.
219 Default is "YES".
220
221 Replacement => REPLACEMENT
222 Specifies error handling scheme. See "Error Handling".
223
224 3-item list of (converted string, charset for output, encoding
225 scheme) will be returned. Encoding scheme will be either "B", "Q"
226 or "undef" (might not be encoded). If charset for output could not
227 be determined and converted string contains non-ASCII byte(s),
228 charset for output will be "8BIT" (this is not charset name but a
229 special value to represent unencodable data) and encoding scheme
230 will be "undef" (should not be encoded). Charset for output will
231 be "US-ASCII" if and only if string does not contain any non-ASCII
232 bytes.
233
234 $charset->undecode(STRING [,CHECK])
235 Encode Unicode string STRING to byte string by input charset of
236 $charset. This is equivalent to "$charset->decoder->encode()".
237
238 Note: When Unicode/multibyte support is disabled (see
239 "USE_ENCODE"), this function will die.
240
241 Manipulating Module Defaults
242 alias ALIAS [, CHARSET]
243 Get/set charset alias for canonical names determined by
244 "canonical_charset".
245
246 If CHARSET is given and isn't false, ALIAS will be assigned as an
247 alias of CHARSET. Otherwise, alias won't be changed. In both
248 cases, current charset name that ALIAS is assigned will be
249 returned.
250
251 default [CHARSET]
252 Get/set default charset.
253
254 Default charset is used by this module when charset context is
255 unknown. Modules using this module are recommended to use this
256 charset when charset context is unknown or implicit default is
257 expected. By default, it is "US-ASCII".
258
259 If CHARSET is given and isn't false, it will be set to default
260 charset. Otherwise, default charset won't be changed. In both
261 cases, current default charset will be returned.
262
263 NOTE: Default charset should not be changed.
264
265 fallback [CHARSET]
266 Get/set fallback charset.
267
268 Fallback charset is used by this module when conversion by given
269 charset is failed and "FALLBACK" error handling scheme is
270 specified. Modules using this module may use this charset as last
271 resort of charset for conversion. By default, it is "UTF-8".
272
273 If CHARSET is given and isn't false, it will be set to fallback
274 charset. If CHARSET is "NONE", fallback charset will be undefined.
275 Otherwise, fallback charset won't be changed. In any cases,
276 current fallback charset will be returned.
277
278 NOTE: It is useful that "US-ASCII" is specified as fallback
279 charset, since result of conversion will be readable without
280 charset information.
281
282 recommended CHARSET [, HEADERENC, BODYENC [, ENCCHARSET]]
283 Get/set charset profiles.
284
285 If optional arguments are given and any of them are not false,
286 profiles for CHARSET will be set by those arguments. Otherwise,
287 profiles won't be changed. In both cases, current profiles for
288 CHARSET will be returned as 3-item list of (HEADERENC, BODYENC,
289 ENCCHARSET).
290
291 HEADERENC is recommended encoding scheme for message header. It
292 may be one of "B", "Q", "S" (shorter one of either) or "undef"
293 (might not be encoded).
294
295 BODYENC is recommended transfer-encoding for message body. It may
296 be one of "B", "Q", "S" (shorter one of either) or "undef" (might
297 not be transfer-encoded).
298
299 ENCCHARSET is a charset which is compatible with given CHARSET and
300 is recommended to be used for MIME messages on Internet. If
301 conversion is not needed (or this module doesn't know appropriate
302 charset), ENCCHARSET is "undef".
303
304 NOTE: This function in the future releases can accept more optional
305 arguments (for example, properties to handle character widths, line
306 folding behavior, ...). So format of returned value may probably
307 be changed. Use "header_encoding", "body_encoding" or
308 "output_charset" to get particular profile.
309
310 Constants
311 USE_ENCODE
312 Unicode/multibyte support flag. Non-empty string will be set when
313 Unicode and multibyte support is enabled. Currently, this flag
314 will be non-empty on Perl 5.7.3 or later and empty string on
315 earlier versions of Perl.
316
317 Error Handling
318 "body_encode" and "header_encode" accept following "Replacement"
319 options:
320
321 "DEFAULT"
322 Put a substitution character in place of a malformed character.
323 For UCM-based encodings, <subchar> will be used.
324
325 "FALLBACK"
326 Try "DEFAULT" scheme using fallback charset (see "fallback"). When
327 fallback charset is undefined and conversion causes error, code
328 will die on error with an error message.
329
330 "CROAK"
331 Code will die on error immediately with an error message.
332 Therefore, you should trap the fatal error with eval{} unless you
333 really want to let it die on error. Synonym is "STRICT".
334
335 "PERLQQ"
336 "HTMLCREF"
337 "XMLCREF"
338 Use "FB_PERLQQ", "FB_HTMLCREF" or "FB_XMLCREF" scheme defined by
339 Encode module.
340
341 numeric values
342 Numeric values are also allowed. For more details see "Handling
343 Malformed Data" in Encode.
344
345 If error handling scheme is not specified or unknown scheme is
346 specified, "DEFAULT" will be assumed.
347
348 Configuration File
349 Built-in defaults for option parameters can be overridden by
350 configuration file: MIME/Charset/Defaults.pm. For more details read
351 MIME/Charset/Defaults.pm.sample.
352
354 Consult $VERSION variable.
355
356 Development versions of this module may be found at
357 <http://hatuka.nezumi.nu/repos/MIME-Charset/>.
358
359 Incompatible Changes
360 Release 1.001
361 • new() method returns an object when CHARSET argument is not
362 specified.
363
364 Release 1.005
365 • Restrict characters in encoded-word according to RFC 2047
366 section 5 (3). This also affects return value of
367 encoded_header_len() method.
368
369 Release 1.008.2
370 • body_encoding() method may also returns "S".
371
372 • Return value of body_encode() method for UTF-8 may include
373 "QUOTED-PRINTABLE" encoding item that in earlier versions was
374 fixed to "BASE64".
375
377 Multipurpose Internet Mail Extensions (MIME).
378
380 Hatuka*nezumi - IKEDA Soji <hatuka(at)nezumi.nu>
381
383 Copyright (C) 2006-2017 Hatuka*nezumi - IKEDA Soji. This program is
384 free software; you can redistribute it and/or modify it under the same
385 terms as Perl itself.
386
387
388
389perl v5.36.0 2023-01-20 MIME::Charset(3)