Unicode::UTF8(3pm)

1Unicode::UTF8(3)      User Contributed Perl Documentation     Unicode::UTF8(3)
2
3
4

NAME

6       Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form
7

SYNOPSIS

9           use Unicode::UTF8 qw[decode_utf8 encode_utf8];
10
11           use warnings FATAL => 'utf8'; # fatalize encoding glitches
12           $string = decode_utf8($octets);
13           $octets = encode_utf8($string);
14

DESCRIPTION

16       This module provides functions to encode and decode UTF-8 encoding form
17       as specified by Unicode and ISO/IEC 10646:2011.
18

FUNCTIONS

20   decode_utf8
21           $string = decode_utf8($octets);
22           $string = decode_utf8($octets, $fallback);
23
24       Returns an decoded representation of $octets in UTF-8 encoding as a
25       character string.
26
27       $fallback is an optional "CODE" reference which provides a error-
28       handling mechanism, allowing customization of error handling. The
29       default error-handling mechanism is to replace any ill-formed UTF-8
30       sequences or encoded code points which can't be interchanged with
31       REPLACEMENT CHARACTER (U+FFFD).
32
33           $string = $fallback->($octets, $is_usv, $position);
34
35       $fallback is invoked with three arguments: $octets, $is_usv and
36       $position. $octets is a sequence of one or more octets containing the
37       maximal subpart of the ill-formed subsequence or encoded code point
38       which can't be interchanged. $is_usv is a boolean indicating whether or
39       not $octets represent a encoded Unicode scalar value. $position is a
40       unsigned integer containing the zero based octet position at which the
41       error occurred within the octets provided to decode_utf8(). $fallback
42       must return a character string consisting of zero or more Unicode
43       scalar values.  Unicode scalar values consist of code points in the
44       range U+0000..U+D7FF and U+E000..U+10FFFF.
45
46   encode_utf8
47           $octets = encode_utf8($string);
48           $octets = encode_utf8($string, $fallback);
49
50       Returns an encoded representation of $string in UTF-8 encoding as an
51       octet string.
52
53       $fallback is an optional "CODE" reference which provides a error-
54       handling mechanism, allowing customization of error handling. The
55       default error-handling mechanism is to replace any code points which
56       can't be interchanged or represented in UTF-8 encoding form with
57       REPLACEMENT CHARACTER (U+FFFD).
58
59           $string = $fallback->($codepoint, $is_usv, $position);
60
61       $fallback is invoked with three arguments: $codepoint, $is_usv and
62       $position. $codepoint is a unsigned integer containing the code point
63       which can't be interchanged or represented in UTF-8 encoding form.
64       $is_usv is a boolean indicating whether or not $codepoint is a Unicode
65       scalar value.  $position is a unsigned integer containing the zero
66       based character position at which the error occurred within the string
67       provided to encode_utf8().  $fallback must return a character string
68       consisting of zero or more Unicode scalar values.Unicode scalar values
69       consist of code points in the range U+0000..U+D7FF and
70       U+E000..U+10FFFF.
71
72   valid_utf8
73           $boolean = valid_utf8($octets);
74
75       Returns a boolean indicating whether or not the given $octets consist
76       of well-formed UTF-8 sequences.
77

EXPORTS

79       None by default. All functions can be exported using the ":all" tag or
80       individually.
81

DIAGNOSTICS

83       Can't decode a wide character string
84           (F) Wide character in octets.
85
86       Can't validate a wide character string
87           (F) Wide character in octets.
88
89       Can't decode ill-formed UTF-8 octet sequence <%s> in position %u
90           (W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s>
91           contains a hexadecimal representation of the maximal subpart of the
92           ill-formed subsequence.
93
94       Can't interchange noncharacter code point U+%X in position %u
95           (W utf8, nonchar) Noncharacters are code points that are
96           permanently reserved in the Unicode Standard for internal use. They
97           are forbidden for use in open interchange of Unicode text data.
98           Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is
99           from 0 to 10^16) and the values U+FDD0..U+FDEF.
100
101       Can't represent surrogate code point U+%X in position %u
102           (W utf8, surrogate) Surrogate code points are designated only for
103           surrogate code units in the UTF-16 character encoding form.
104           Surrogates consist of code points in the range U+D800 to U+DFFF.
105
106       Can't represent super code point \x{%X} in position %u
107           (W utf8, non_unicode) Code points greater than U+10FFFF. Perl's
108           extended codespace.
109
110       Can't decode ill-formed UTF-X octet sequence <%s> in position %u
111           (F) Encountered an ill-formed octet sequence in Perl's internal
112           representation of wide characters.
113
114       The sub-categories: "nonchar", "surrogate" and "non_unicode" is only
115       available on Perl 5.14 or greater. See perllexwarn for available
116       categories and hierarchies.
117

COMPARISON

119       Here is a summary of features for comparison with Encode's UTF-8
120       implementation:
121
122       •   Simple API which makes use of Perl's standard warning categories.
123
124       •   Recognizes all noncharacters regardless of Perl version
125
126       •   Implements Unicode's recommended practice for using U+FFFD.
127
128       •   Better diagnostics in warning messages
129
130       •   Detects and reports inconsistency in Perl's internal representation
131           of wide characters (UTF-X)
132
133       •   Preserves taintedness of decoded $octets or encoded $string
134
135       •   Better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%,
136           EN: 1200%, see benchmarks directory in git repository)
137

CONFORMANCE

139       It's the author's belief that this UTF-8 implementation is conformant
140       with the Unicode Standard Version 6.0. Any deviations from the Unicode
141       Standard is to be considered a bug.
142

SUPPORT

148   BUGS
149       Please report any bugs by email to "bug-unicode-utf8 at rt.cpan.org",
150       or through the web interface at
151       <http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8>.  You
152       will be automatically notified of any progress on the request by the
153       system.
154
155   SOURCE CODE
156       This is open source software. The code repository is available for
157       public review and contribution under the terms of the license.
158
159       <http://github.com/chansen/p5-unicode-utf8>
160
161           git clone http://github.com/chansen/p5-unicode-utf8
162

AUTHOR

164       Christian Hansen "chansen@cpan.org"
165

COPYRIGHT

167       Copyright 2011-2017 by Christian Hansen.
168
169       This is free software; you can redistribute it and/or modify it under
170       the same terms as the Perl 5 programming language system itself.
171
172
173
174perl v5.38.0                      2023-07-21                  Unicode::UTF8(3)