1Text::Unidecode(3)    User Contributed Perl Documentation   Text::Unidecode(3)
2
3
4

NAME

6       Text::Unidecode -- US-ASCII transliterations of Unicode text
7

SYNOPSIS

9         use utf8;
10         use Text::Unidecode;
11         print unidecode(
12           "\x{5317}\x{4EB0}\n"
13            # those are the Chinese characters for Beijing
14         );
15
16         # That prints: Bei Jing
17

DESCRIPTION

19       It often happens that you have non-Roman text data in Unicode, but you
20       can't display it -- usually because you're trying to show it to a user
21       via an application that doesn't support Unicode, or because the fonts
22       you need aren't accessible.  You could represent the Unicode characters
23       as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
24       user who actually wants to read what the text says.
25
26       What Text::Unidecode provides is a function, "unidecode(...)" that
27       takes Unicode data and tries to represent it in US-ASCII characters
28       (i.e., the universally displayable characters between 0x00 and 0x7F).
29       The representation is almost always an attempt at transliteration --
30       i.e., conveying, in Roman letters, the pronunciation expressed by the
31       text in some other writing system.  (See the example in the synopsis.)
32
33       Unidecode's ability to transliterate is limited by two factors:
34
35       ·   The amount and quality of data in the original
36
37           So if you have Hebrew data that has no vowel points in it, then
38           Unidecode cannot guess what vowels should appear in a
39           pronounciation.  S f y hv n vwls n th npt, y wn't gt ny vwls n th
40           tpt.  (This is a specific application of the general principle of
41           "Garbage In, Garbage Out".)
42
43       ·   Basic limitations in the Unidecode design
44
45           Writing a real and clever transliteration algorithm for any single
46           language usually requires a lot of time, and at least a passable
47           knowledge of the language involved.  But Unicode text can convey
48           more languages than I could possibly learn (much less create a
49           transliterator for) in the entire rest of my lifetime.  So I put a
50           cap on how intelligent Unidecode could be, by insisting that it
51           support only context-insensitive transliteration.  That means
52           missing the finer details of any given writing system, while still
53           hopefully being useful.
54
55       Unidecode, in other words, is quick and dirty.  Sometimes the output is
56       not so dirty at all: Russian and Greek seem to work passably; and while
57       Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing
58       system, setting up a mapping from it to Roman letters seems to work
59       pretty well.  But sometimes the output is very dirty: Unidecode does
60       quite badly on Japanese and Thai.
61
62       If you want a smarter transliteration for a particular language than
63       Unidecode provides, then you should look for (or write) a
64       transliteration algorithm specific to that language, and apply it
65       instead of (or at least before) applying Unidecode.
66
67       In other words, Unidecode's approach is broad (knowing about dozens of
68       writing systems), but shallow (not being meticulous about any of them).
69

FUNCTIONS

71       Text::Unidecode provides one function, "unidecode(...)", which is
72       exported by default.  It can be used in a variety of calling contexts:
73
74       "$out = unidecode($in);" # scalar context
75           This returns a copy of $in, transliterated.
76
77       "$out = unidecode(@in);" # scalar context
78           This is the same as "$out = unidecode(join '', @in);"
79
80       "@out = unidecode(@in);" # list context
81           This returns a list consisting of copies of @in, each
82           transliterated.  This is the same as "@out = map
83           scalar(unidecode($_)), @in;"
84
85       "unidecode(@items);" # void context
86       "unidecode(@bar, $foo, @baz);" # void context
87           Each item on input is replaced with its transliteration.  This is
88           the same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"
89
90       You should make a minimum of assumptions about the output of
91       "unidecode(...)".  For example, if you assume an all-alphabetic
92       (Unicode) string passed to "unidecode(...)" will return an all-
93       alphabetic string, you're wrong -- some alphabetic Unicode characters
94       are transliterated as strings containing punctuation (e.g., the
95       Armenian letter at 0x0539 currently transliterates as "T`".
96
97       However, these are the assumptions you can make:
98
99       ·   Each character 0x0000 - 0x007F transliterates as itself.  That is,
100           "unidecode(...)" is 7-bit pure.
101
102       ·   The output of "unidecode(...)" always consists entirely of US-ASCII
103           characters -- i.e., characters 0x0000 - 0x007F.
104
105       ·   All Unicode characters translate to a sequence of (any number of)
106           characters that are newline ("\n") or in the range 0x0020-0x007E.
107           That is, no Unicode character translates to "\x01", for example.
108           (Altho if you have a "\x01" on input, you'll get a "\x01" in
109           output.)
110
111       ·   Yes, some transliterations produce a "\n" -- but just a few, and
112           only with good reason.  Note that the value of newline ("\n")
113           varies from platform to platform -- see "perlport" in perlport.
114
115       ·   Some Unicode characters may transliterate to nothing (i.e., empty
116           string).
117
118       ·   Very many Unicode characters transliterate to multi-character
119           sequences.  E.g., Han character 0x5317 transliterates as the four-
120           character string "Bei ".
121
122       ·   Within these constraints, I may change the transliteration of
123           characters in future versions.  For example, if someone convinces
124           me that the Armenian letter at 0x0539, currently transliterated as
125           "T`", would be better transliterated as "D", I may well make that
126           change.
127

DESIGN GOALS AND CONSTRAINTS

129       Text::Unidecode is meant to be a transliterator-of-last resort, to be
130       used once you've decided that you can't just display the Unicode data
131       as is, and once you've decided you don't have a more clever, language-
132       specific transliterator available.  It transliterates context-
133       insensitively -- that is, a given character is replaced with the same
134       US-ASCII (7-bit ASCII) character or characters, no matter what the
135       surrounding character are.
136
137       The main reason I'm making Text::Unidecode work with only context-
138       insensitive substitution is that it's fast, dumb, and straightforward
139       enough to be feasable.  It doesn't tax my (quite limited) knowledge of
140       world languages.  It doesn't require me writing a hundred lines of code
141       to get the Thai syllabification right (and never knowing whether I've
142       gotten it wrong, because I don't know Thai), or spending a year trying
143       to get Text::Unidecode to use the ChaSen algorithm for Japanese, or
144       trying to write heuristics for telling the difference between Japanese,
145       Chinese, or Korean, so it knows how to transliterate any given Uni-Han
146       glyph.  And moreover, context-insensitive substitution is still mostly
147       useful, but still clearly couldn't be mistaken for authoritative.
148
149       Text::Unidecode is an example of the 80/20 rule in action -- you get
150       80% of the usefulness using just 20% of a "real" solution.
151
152       A "real" approach to transliteration for any given language can involve
153       such increasingly tricky contextual factors as these
154
155       The previous / preceding character(s)
156           What a given symbol "X" means, could depend on whether it's
157           followed by a consonant, or by vowel, or by some diacritic
158           character.
159
160       Syllables
161           A character "X" at end of a syllable could mean something different
162           from when it's at the start -- which is especially problematic when
163           the language involved doesn't explicitly mark where one syllable
164           stops and the next starts.
165
166       Parts of speech
167           What "X" sounds like at the end of a word, depends on whether that
168           word is a noun, or a verb, or what.
169
170       Meaning
171           By semantic context, you can tell that this ideogram "X" means
172           "shoe" (pronounced one way) and not "time" (pronounced another),
173           and that's how you know to transliterate it one way instead of the
174           other.
175
176       Origin of the word
177           "X" means one thing in loanwords and/or placenames (and derivatives
178           thereof), and another in native words.
179
180       "It's just that way"
181           "X" normally makes the /X/ sound, except for this list of seventy
182           exceptions (and words based on them, sometimes indirectly).  Or:
183           you never can tell which of the three ways to pronounce "X" this
184           word actually uses; you just have to know which it is, so keep a
185           dictionary on hand!
186
187       Language
188           The character "X" is actually used in several different languages,
189           and you have to figure out which you're looking at before you can
190           determine how to transliterate it.
191
192       Out of a desire to avoid being mired in any of these kinds of
193       contextual factors, I chose to exclude all of them and just stick with
194       context-insensitive replacement.
195

TODO

197       Things that need tending to are detailed in the TODO.txt file, included
198       in this distribution.  Normal installs probably don't leave the
199       TODO.txt lying around, but if nothing else, you can see it at
200       http://search.cpan.org/search?dist=Text::Unidecode
201

MOTTO

203       The Text::Unidecode motto is:
204
205         It's better than nothing!
206
207       ...in both meanings: 1) seeing the output of "unidecode(...)" is better
208       than just having all font-unavailable Unicode characters replaced with
209       "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
210       nothing that Text::Unidecode's algorithm is better than.
211

CAVEATS

213       If you get really implausible nonsense out of "unidecode(...)", make
214       sure that the input data really is a utf8 string.  See "perlunicode" in
215       perlunicode.
216

THANKS

218       Thanks to Harald Tveit Alvestrand, Abhijit Menon-Sen, and Mark-Jason
219       Dominus.
220

SEE ALSO

222       Unicode Consortium: http://www.unicode.org/
223
224       Geoffrey Sampson.  1990.  Writing Systems: A Linguistic Introduction.
225       ISBN: 0804717567
226
227       Randall K. Barry (editor).  1997.  ALA-LC Romanization Tables:
228       Transliteration Schemes for Non-Roman Scripts.  ISBN: 0844409405 [ALA
229       is the American Library Association; LC is the Library of Congress.]
230
231       Rupert Snell.  2000.  Beginner's Hindi Script (Teach Yourself Books).
232       ISBN: 0658009109
233
235       Copyright (c) 2001 Sean M. Burke. All rights reserved.
236
237       This library is free software; you can redistribute it and/or modify it
238       under the same terms as Perl itself.
239
240       This program is distributed in the hope that it will be useful, but
241       without any warranty; without even the implied warranty of
242       merchantability or fitness for a particular purpose.
243
244       Much of Text::Unidecode's internal data is based on data from The
245       Unicode Consortium, with which I am unafiliated.
246

AUTHOR

248       Sean M. Burke "sburke@cpan.org"
249
250
251
252perl v5.10.1                      2001-07-14                Text::Unidecode(3)
Impressum