1Text::Unidecode(3) User Contributed Perl Documentation Text::Unidecode(3)
2
3
4
6 Text::Unidecode -- US-ASCII transliterations of Unicode text
7
9 use utf8;
10 use Text::Unidecode;
11 print unidecode(
12 "\x{5317}\x{4EB0}\n"
13 # those are the Chinese characters for Beijing
14 );
15
16 # That prints: Bei Jing
17
19 It often happens that you have non-Roman text data in Unicode, but you
20 can't display it -- usually because you're trying to show it to a user
21 via an application that doesn't support Unicode, or because the fonts
22 you need aren't accessible. You could represent the Unicode characters
23 as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
24 user who actually wants to read what the text says.
25
26 What Text::Unidecode provides is a function, "unidecode(...)" that
27 takes Unicode data and tries to represent it in US-ASCII characters
28 (i.e., the universally displayable characters between 0x00 and 0x7F).
29 The representation is almost always an attempt at transliteration --
30 i.e., conveying, in Roman letters, the pronunciation expressed by the
31 text in some other writing system. (See the example in the synopsis.)
32
33 Unidecode's ability to transliterate is limited by two factors:
34
35 * The amount and quality of data in the original
36 So if you have Hebrew data that has no vowel points in it, then
37 Unidecode cannot guess what vowels should appear in a pronouncia‐
38 tion. S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt. (This
39 is a specific application of the general principle of "Garbage In,
40 Garbage Out".)
41
42 * Basic limitations in the Unidecode design
43 Writing a real and clever transliteration algorithm for any single
44 language usually requires a lot of time, and at least a passable
45 knowledge of the language involved. But Unicode text can convey
46 more languages than I could possibly learn (much less create a
47 transliterator for) in the entire rest of my lifetime. So I put a
48 cap on how intelligent Unidecode could be, by insisting that it
49 support only context-insensitive transliteration. That means miss‐
50 ing the finer details of any given writing system, while still
51 hopefully being useful.
52
53 Unidecode, in other words, is quick and dirty. Sometimes the output is
54 not so dirty at all: Russian and Greek seem to work passably; and while
55 Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing sys‐
56 tem, setting up a mapping from it to Roman letters seems to work pretty
57 well. But sometimes the output is very dirty: Unidecode does quite
58 badly on Japanese and Thai.
59
60 If you want a smarter transliteration for a particular language than
61 Unidecode provides, then you should look for (or write) a translitera‐
62 tion algorithm specific to that language, and apply it instead of (or
63 at least before) applying Unidecode.
64
65 In other words, Unidecode's approach is broad (knowing about dozens of
66 writing systems), but shallow (not being meticulous about any of them).
67
69 Text::Unidecode provides one function, "unidecode(...)", which is
70 exported by default. It can be used in a variety of calling contexts:
71
72 "$out = unidecode($in);" # scalar context
73 This returns a copy of $in, transliterated.
74
75 "$out = unidecode(@in);" # scalar context
76 This is the same as "$out = unidecode(join '', @in);"
77
78 "@out = unidecode(@in);" # list context
79 This returns a list consisting of copies of @in, each transliter‐
80 ated. This is the same as "@out = map scalar(unidecode($_)), @in;"
81
82 "unidecode(@items);" # void context
83 "unidecode(@bar, $foo, @baz);" # void context
84 Each item on input is replaced with its transliteration. This is
85 the same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"
86
87 You should make a minimum of assumptions about the output of "unide‐
88 code(...)". For example, if you assume an all-alphabetic (Unicode)
89 string passed to "unidecode(...)" will return an all-alphabetic string,
90 you're wrong -- some alphabetic Unicode characters are transliterated
91 as strings containing punctuation (e.g., the Armenian letter at 0x0539
92 currently transliterates as "T`".
93
94 However, these are the assumptions you can make:
95
96 · Each character 0x0000 - 0x007F transliterates as itself. That is,
97 "unidecode(...)" is 7-bit pure.
98
99 · The output of "unidecode(...)" always consists entirely of US-ASCII
100 characters -- i.e., characters 0x0000 - 0x007F.
101
102 · All Unicode characters translate to a sequence of (any number of)
103 characters that are newline ("\n") or in the range 0x0020-0x007E.
104 That is, no Unicode character translates to "\x01", for example.
105 (Altho if you have a "\x01" on input, you'll get a "\x01" in out‐
106 put.)
107
108 · Yes, some transliterations produce a "\n" -- but just a few, and
109 only with good reason. Note that the value of newline ("\n")
110 varies from platform to platform -- see "perlport" in perlport.
111
112 · Some Unicode characters may transliterate to nothing (i.e., empty
113 string).
114
115 · Very many Unicode characters transliterate to multi-character
116 sequences. E.g., Han character 0x5317 transliterates as the four-
117 character string "Bei ".
118
119 · Within these constraints, I may change the transliteration of char‐
120 acters in future versions. For example, if someone convinces me
121 that the Armenian letter at 0x0539, currently transliterated as
122 "T`", would be better transliterated as "D", I may well make that
123 change.
124
126 Text::Unidecode is meant to be a transliterator-of-last resort, to be
127 used once you've decided that you can't just display the Unicode data
128 as is, and once you've decided you don't have a more clever, language-
129 specific transliterator available. It transliterates context-insensi‐
130 tively -- that is, a given character is replaced with the same US-ASCII
131 (7-bit ASCII) character or characters, no matter what the surrounding
132 character are.
133
134 The main reason I'm making Text::Unidecode work with only context-
135 insensitive substitution is that it's fast, dumb, and straightforward
136 enough to be feasable. It doesn't tax my (quite limited) knowledge of
137 world languages. It doesn't require me writing a hundred lines of code
138 to get the Thai syllabification right (and never knowing whether I've
139 gotten it wrong, because I don't know Thai), or spending a year trying
140 to get Text::Unidecode to use the ChaSen algorithm for Japanese, or
141 trying to write heuristics for telling the difference between Japanese,
142 Chinese, or Korean, so it knows how to transliterate any given Uni-Han
143 glyph. And moreover, context-insensitive substitution is still mostly
144 useful, but still clearly couldn't be mistaken for authoritative.
145
146 Text::Unidecode is an example of the 80/20 rule in action -- you get
147 80% of the usefulness using just 20% of a "real" solution.
148
149 A "real" approach to transliteration for any given language can involve
150 such increasingly tricky contextual factors as these
151
152 The previous / preceding character(s)
153 What a given symbol "X" means, could depend on whether it's fol‐
154 lowed by a consonant, or by vowel, or by some diacritic character.
155
156 Syllables
157 A character "X" at end of a syllable could mean something different
158 from when it's at the start -- which is especially problematic when
159 the language involved doesn't explicitly mark where one syllable
160 stops and the next starts.
161
162 Parts of speech
163 What "X" sounds like at the end of a word, depends on whether that
164 word is a noun, or a verb, or what.
165
166 Meaning
167 By semantic context, you can tell that this ideogram "X" means
168 "shoe" (pronounced one way) and not "time" (pronounced another),
169 and that's how you know to transliterate it one way instead of the
170 other.
171
172 Origin of the word
173 "X" means one thing in loanwords and/or placenames (and derivatives
174 thereof), and another in native words.
175
176 "It's just that way"
177 "X" normally makes the /X/ sound, except for this list of seventy
178 exceptions (and words based on them, sometimes indirectly). Or:
179 you never can tell which of the three ways to pronounce "X" this
180 word actually uses; you just have to know which it is, so keep a
181 dictionary on hand!
182
183 Language
184 The character "X" is actually used in several different languages,
185 and you have to figure out which you're looking at before you can
186 determine how to transliterate it.
187
188 Out of a desire to avoid being mired in any of these kinds of contex‐
189 tual factors, I chose to exclude all of them and just stick with con‐
190 text-insensitive replacement.
191
193 Things that need tending to are detailed in the TODO.txt file, included
194 in this distribution. Normal installs probably don't leave the
195 TODO.txt lying around, but if nothing else, you can see it at
196 http://search.cpan.org/search?dist=Text::Unidecode
197
199 The Text::Unidecode motto is:
200
201 It's better than nothing!
202
203 ...in both meanings: 1) seeing the output of "unidecode(...)" is better
204 than just having all font-unavailable Unicode characters replaced with
205 "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
206 nothing that Text::Unidecode's algorithm is better than.
207
209 If you get really implausible nonsense out of "unidecode(...)", make
210 sure that the input data really is a utf8 string. See "perlunicode" in
211 perlunicode.
212
214 Thanks to Harald Tveit Alvestrand, Abhijit Menon-Sen, and Mark-Jason
215 Dominus.
216
218 Unicode Consortium: http://www.unicode.org/
219
220 Geoffrey Sampson. 1990. Writing Systems: A Linguistic Introduction.
221 ISBN: 0804717567
222
223 Randall K. Barry (editor). 1997. ALA-LC Romanization Tables:
224 Transliteration Schemes for Non-Roman Scripts. ISBN: 0844409405 [ALA
225 is the American Library Association; LC is the Library of Congress.]
226
227 Rupert Snell. 2000. Beginner's Hindi Script (Teach Yourself Books).
228 ISBN: 0658009109
229
231 Copyright (c) 2001 Sean M. Burke. All rights reserved.
232
233 This library is free software; you can redistribute it and/or modify it
234 under the same terms as Perl itself.
235
236 This program is distributed in the hope that it will be useful, but
237 without any warranty; without even the implied warranty of mer‐
238 chantability or fitness for a particular purpose.
239
240 Much of Text::Unidecode's internal data is based on data from The Uni‐
241 code Consortium, with which I am unafiliated.
242
244 Sean M. Burke "sburke@cpan.org"
245
246
247
248perl v5.8.8 2001-07-14 Text::Unidecode(3)