Text::Unidecode(3pm)

1Text::Unidecode(3)    User Contributed Perl Documentation   Text::Unidecode(3)
2
3
4

NAME

6       Text::Unidecode -- US-ASCII transliterations of Unicode text
7

SYNOPSIS

9         use utf8;
10         use Text::Unidecode;
11         print unidecode(
12           "\x{5317}\x{4EB0}\n"
13            # those are the Chinese characters for Beijing
14         );
15
16         # That prints: Bei Jing
17

DESCRIPTION

19       It often happens that you have non-Roman text data in Unicode, but you
20       can't display it -- usually because you're trying to show it to a user
21       via an application that doesn't support Unicode, or because the fonts
22       you need aren't accessible.  You could represent the Unicode characters
23       as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
24       user who actually wants to read what the text says.
25
26       What Text::Unidecode provides is a function, "unidecode(...)" that
27       takes Unicode data and tries to represent it in US-ASCII characters
28       (i.e., the universally displayable characters between 0x00 and 0x7F).
29       The representation is almost always an attempt at transliteration --
30       i.e., conveying, in Roman letters, the pronunciation expressed by the
31       text in some other writing system.  (See the example in the synopsis.)
32
33       Unidecode's ability to transliterate is limited by two factors:
34
35       * The amount and quality of data in the original
36           So if you have Hebrew data that has no vowel points in it, then
37           Unidecode cannot guess what vowels should appear in a pronouncia‐
38           tion.  S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt.  (This
39           is a specific application of the general principle of "Garbage In,
40           Garbage Out".)
41
42       * Basic limitations in the Unidecode design
43           Writing a real and clever transliteration algorithm for any single
44           language usually requires a lot of time, and at least a passable
45           knowledge of the language involved.  But Unicode text can convey
46           more languages than I could possibly learn (much less create a
47           transliterator for) in the entire rest of my lifetime.  So I put a
48           cap on how intelligent Unidecode could be, by insisting that it
49           support only context-insensitive transliteration.  That means miss‐
50           ing the finer details of any given writing system, while still
51           hopefully being useful.
52
53       Unidecode, in other words, is quick and dirty.  Sometimes the output is
54       not so dirty at all: Russian and Greek seem to work passably; and while
55       Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing sys‐
56       tem, setting up a mapping from it to Roman letters seems to work pretty
57       well.  But sometimes the output is very dirty: Unidecode does quite
58       badly on Japanese and Thai.
59
60       If you want a smarter transliteration for a particular language than
61       Unidecode provides, then you should look for (or write) a translitera‐
62       tion algorithm specific to that language, and apply it instead of (or
63       at least before) applying Unidecode.
64
65       In other words, Unidecode's approach is broad (knowing about dozens of
66       writing systems), but shallow (not being meticulous about any of them).
67

FUNCTIONS

69       Text::Unidecode provides one function, "unidecode(...)", which is
70       exported by default.  It can be used in a variety of calling contexts:
71
72       "$out = unidecode($in);" # scalar context
73           This returns a copy of $in, transliterated.
74
75       "$out = unidecode(@in);" # scalar context
76           This is the same as "$out = unidecode(join '', @in);"
77
78       "@out = unidecode(@in);" # list context
79           This returns a list consisting of copies of @in, each transliter‐
80           ated.  This is the same as "@out = map scalar(unidecode($_)), @in;"
81
82       "unidecode(@items);" # void context
83       "unidecode(@bar, $foo, @baz);" # void context
84           Each item on input is replaced with its transliteration.  This is
85           the same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"
86
87       You should make a minimum of assumptions about the output of "unide‐
88       code(...)".  For example, if you assume an all-alphabetic (Unicode)
89       string passed to "unidecode(...)" will return an all-alphabetic string,
90       you're wrong -- some alphabetic Unicode characters are transliterated
91       as strings containing punctuation (e.g., the Armenian letter at 0x0539
92       currently transliterates as "T`".
93
94       However, these are the assumptions you can make:
95
96       ·   Each character 0x0000 - 0x007F transliterates as itself.  That is,
97           "unidecode(...)" is 7-bit pure.
98
99       ·   The output of "unidecode(...)" always consists entirely of US-ASCII
100           characters -- i.e., characters 0x0000 - 0x007F.
101
102       ·   All Unicode characters translate to a sequence of (any number of)
103           characters that are newline ("\n") or in the range 0x0020-0x007E.
104           That is, no Unicode character translates to "\x01", for example.
105           (Altho if you have a "\x01" on input, you'll get a "\x01" in out‐
106           put.)
107
108       ·   Yes, some transliterations produce a "\n" -- but just a few, and
109           only with good reason.  Note that the value of newline ("\n")
110           varies from platform to platform -- see "perlport" in perlport.
111
112       ·   Some Unicode characters may transliterate to nothing (i.e., empty
113           string).
114
115       ·   Very many Unicode characters transliterate to multi-character
116           sequences.  E.g., Han character 0x5317 transliterates as the four-
117           character string "Bei ".
118
119       ·   Within these constraints, I may change the transliteration of char‐
120           acters in future versions.  For example, if someone convinces me
121           that the Armenian letter at 0x0539, currently transliterated as
122           "T`", would be better transliterated as "D", I may well make that
123           change.
124

DESIGN GOALS AND CONSTRAINTS

126       Text::Unidecode is meant to be a transliterator-of-last resort, to be
127       used once you've decided that you can't just display the Unicode data
128       as is, and once you've decided you don't have a more clever, language-
129       specific transliterator available.  It transliterates context-insensi‐
130       tively -- that is, a given character is replaced with the same US-ASCII
131       (7-bit ASCII) character or characters, no matter what the surrounding
132       character are.
133
134       The main reason I'm making Text::Unidecode work with only context-
135       insensitive substitution is that it's fast, dumb, and straightforward
136       enough to be feasable.  It doesn't tax my (quite limited) knowledge of
137       world languages.  It doesn't require me writing a hundred lines of code
138       to get the Thai syllabification right (and never knowing whether I've
139       gotten it wrong, because I don't know Thai), or spending a year trying
140       to get Text::Unidecode to use the ChaSen algorithm for Japanese, or
141       trying to write heuristics for telling the difference between Japanese,
142       Chinese, or Korean, so it knows how to transliterate any given Uni-Han
143       glyph.  And moreover, context-insensitive substitution is still mostly
144       useful, but still clearly couldn't be mistaken for authoritative.
145
146       Text::Unidecode is an example of the 80/20 rule in action -- you get
147       80% of the usefulness using just 20% of a "real" solution.
148
149       A "real" approach to transliteration for any given language can involve
150       such increasingly tricky contextual factors as these
151
152       The previous / preceding character(s)
153           What a given symbol "X" means, could depend on whether it's fol‐
154           lowed by a consonant, or by vowel, or by some diacritic character.
155
156       Syllables
157           A character "X" at end of a syllable could mean something different
158           from when it's at the start -- which is especially problematic when
159           the language involved doesn't explicitly mark where one syllable
160           stops and the next starts.
161
162       Parts of speech
163           What "X" sounds like at the end of a word, depends on whether that
164           word is a noun, or a verb, or what.
165
166       Meaning
167           By semantic context, you can tell that this ideogram "X" means
168           "shoe" (pronounced one way) and not "time" (pronounced another),
169           and that's how you know to transliterate it one way instead of the
170           other.
171
172       Origin of the word
173           "X" means one thing in loanwords and/or placenames (and derivatives
174           thereof), and another in native words.
175
176       "It's just that way"
177           "X" normally makes the /X/ sound, except for this list of seventy
178           exceptions (and words based on them, sometimes indirectly).  Or:
179           you never can tell which of the three ways to pronounce "X" this
180           word actually uses; you just have to know which it is, so keep a
181           dictionary on hand!
182
183       Language
184           The character "X" is actually used in several different languages,
185           and you have to figure out which you're looking at before you can
186           determine how to transliterate it.
187
188       Out of a desire to avoid being mired in any of these kinds of contex‐
189       tual factors, I chose to exclude all of them and just stick with con‐
190       text-insensitive replacement.
191

TODO

193       Things that need tending to are detailed in the TODO.txt file, included
194       in this distribution.  Normal installs probably don't leave the
195       TODO.txt lying around, but if nothing else, you can see it at
196       http://search.cpan.org/search?dist=Text::Unidecode
197

MOTTO

199       The Text::Unidecode motto is:
200
201         It's better than nothing!
202
203       ...in both meanings: 1) seeing the output of "unidecode(...)" is better
204       than just having all font-unavailable Unicode characters replaced with
205       "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
206       nothing that Text::Unidecode's algorithm is better than.
207

CAVEATS

209       If you get really implausible nonsense out of "unidecode(...)", make
210       sure that the input data really is a utf8 string.  See "perlunicode" in
211       perlunicode.
212

THANKS

214       Thanks to Harald Tveit Alvestrand, Abhijit Menon-Sen, and Mark-Jason
215       Dominus.
216

COPYRIGHT AND DISCLAIMERS

231       Copyright (c) 2001 Sean M. Burke. All rights reserved.
232
233       This library is free software; you can redistribute it and/or modify it
234       under the same terms as Perl itself.
235
236       This program is distributed in the hope that it will be useful, but
237       without any warranty; without even the implied warranty of mer‐
238       chantability or fitness for a particular purpose.
239
240       Much of Text::Unidecode's internal data is based on data from The Uni‐
241       code Consortium, with which I am unafiliated.
242

AUTHOR

244       Sean M. Burke "sburke@cpan.org"
245
246
247
248perl v5.8.8                       2001-07-14                Text::Unidecode(3)