1Map8(3) User Contributed Perl Documentation Map8(3)
2
3
4
6 Unicode::Map8 - Mapping table between 8-bit chars and Unicode
7
9 require Unicode::Map8;
10 my $no_map = Unicode::Map8->new("ISO646-NO") ⎪⎪ die;
11 my $l1_map = Unicode::Map8->new("latin1") ⎪⎪ die;
12
13 my $ustr = $no_map->to16("V}re norske tegn b⎪r {res\n");
14 my $lstr = $l1_map->to8($ustr);
15 print $lstr;
16
17 print $no_map->tou("V}re norske tegn b⎪r {res\n")->utf8
18
20 The Unicode::Map8 class implement efficient mapping tables between
21 8-bit character sets and 16 bit character sets like Unicode. The
22 tables are efficient both in terms of space allocated and translation
23 speed. The 16-bit strings is assumed to use network byte order.
24
25 The following methods are available:
26
27 $m = Unicode::Map8->new( [$charset] )
28 The object constructor creates new instances of the Unicode::Map8
29 class. I takes an optional argument that specify then name of a
30 8-bit character set to initialize mappings from. The argument can
31 also be a the name of a mapping file. If the charset/file can not
32 be located, then the constructor returns undef.
33
34 If you omit the argument, then an empty mapping table is con‐
35 structed. You must then add mapping pairs to it using the
36 addpair() method described below.
37
38 $m->addpair( $u8, $u16 );
39 Adds a new mapping pair to the mapping object. It takes two argu‐
40 ments. The first is the code value in the 8-bit character set and
41 the second is the corresponding code value in the 16-bit character
42 set. The same codes can be used multiple times (but using the same
43 pair has no effect). The first definition for a code is the one
44 that is used.
45
46 Consider the following example:
47
48 $m->addpair(0x20, 0x0020);
49 $m->addpair(0x20, 0x00A0);
50 $m->addpair(0xA0, 0x00A0);
51
52 It means that the character 0x20 and 0xA0 in the 8-bit charset maps
53 to themselves in the 16-bit set, but in the 16-bit character set
54 0x0A0 maps to 0x20.
55
56 $m->default_to8( $u8 )
57 Set the code of the default character to use when mapping from
58 16-bit to 8-bit strings. If there is no mapping pair defined for a
59 character then this default is substituted by to8() and recode8().
60
61 $m->default_to16( $u16 )
62 Set the code of the default character to use when mapping from
63 8-bit to 16-bit strings. If there is no mapping pair defined for a
64 character then this default is used by to16(), tou() and recode8().
65
66 $m->nostrict;
67 All undefined mappings are replaced with the identity mapping.
68 Undefined character are normally just removed (or replaced with the
69 default if defined) when converting between character sets.
70
71 $m->to8( $ustr );
72 Converts a 16-bit character string to the corresponding string in
73 the 8-bit character set.
74
75 $m->to16( $str );
76 Converts a 8-bit character string to the corresponding string in
77 the 16-bit character set.
78
79 $m->tou( $str );
80 Same an to16() but return a Unicode::String object instead of a
81 plain UCS2 string.
82
83 $m->recode8($m2, $str);
84 Map the string $str from one 8-bit character set ($m) to another
85 one ($m2). Since we assume we know the mappings towards the common
86 16-bit encoding we can use this to convert between any of the 8-bit
87 character sets.
88
89 $m->to_char16( $u8 )
90 Maps a single 8-bit character code to an 16-bit code. If the 8-bit
91 character is unmapped then the constant NOCHAR is returned. The
92 default is not used and the callback method is not invoked.
93
94 $m->to_char8( $u16 )
95 Maps a single 16-bit character code to an 8-bit code. If the 16-bit
96 character is unmapped then the constant NOCHAR is returned. The
97 default is not used and the callback method is not invoked.
98
99 The following callback methods are available. You can override these
100 methods by creating a subclass of Unicode::Map8.
101
102 $m->unmapped_to8
103 When mapping to 8-bit character string and there is no mapping
104 defined (and no default either), then this method is called as the
105 last resort. It is called with a single integer argument which is
106 the code of the unmapped 16-bit character. It is expected to
107 return a string that will be incorporated in the 8-bit string. The
108 default version of this method always returns an empty string.
109
110 Example:
111
112 package MyMapper;
113 @ISA=qw(Unicode::Map8);
114
115 sub unmapped_to8
116 {
117 my($self, $code) = @_;
118 require Unicode::CharName;
119 "<" . Unicode::CharName::uname($code) . ">";
120 }
121
122 $m->unmapped_to16
123 Likewise when mapping to 16-bit character string and no mapping is
124 defined then this method is called. It should return a 16-bit
125 string with the bytes in network byte order. The default version
126 of this method always returns an empty string.
127
129 The Unicode::Map8 constructor can parse two different file formats; a
130 binary format and a textual format.
131
132 The binary format is simple. It consist of a sequence of 16-bit inte‐
133 ger pairs in network byte order. The first pair should contain the
134 magic value 0xFFFE, 0x0001. Of each pair, the first value is the code
135 of an 8-bit character and the second is the code of the 16-bit charac‐
136 ter. If follows from this that the first value should be less than
137 256.
138
139 The textual format consist of lines that is either a comment (first
140 non-blank character is '#'), a completely blank line or a line with two
141 hexadecimal numbers. The hexadecimal numbers must be preceded by "0x"
142 as in C and Perl. This is the same format used by the Unicode mapping
143 files available from <URL:ftp://ftp.unicode.org/Public>.
144
145 The mapping table files are installed in the Unicode/Map8/maps direc‐
146 tory somewhere in the Perl @INC path. The variable $Uni‐
147 code::Map8::MAPS_DIR is the complete path name to this directory.
148 Binary mapping files are stored within this directory with the suffix
149 .bin. Textual mapping files are stored with the suffix .txt.
150
151 The scripts map8_bin2txt and map8_txt2bin can translate between these
152 mapping file formats.
153
154 A special file called aliases within $MAPS_DIR specify all the alias
155 names that can be used to denote the various character sets. The first
156 name of each line is the real file name and the rest is alias names
157 separated by space.
158
159 The `"umap --list"' command be used to list the character sets sup‐
160 ported.
161
163 Does not handle Unicode surrogate pairs as a single character.
164
166 umap(1), Unicode::String
167
169 Copyright 1998 Gisle Aas.
170
171 This library is free software; you can redistribute it and/or modify it
172 under the same terms as Perl itself.
173
174
175
176perl v5.8.8 2002-12-27 Map8(3)