1Text::Unidecode(3) User Contributed Perl Documentation Text::Unidecode(3)
2
3
4
6 Text::Unidecode -- plain ASCII transliterations of Unicode text
7
9 use utf8;
10 use Text::Unidecode;
11 print unidecode(
12 "北亰\n"
13 # Chinese characters for Beijing (U+5317 U+4EB0)
14 );
15
16 # That prints: Bei Jing
17
19 It often happens that you have non-Roman text data in Unicode, but you
20 can't display it-- usually because you're trying to show it to a user
21 via an application that doesn't support Unicode, or because the fonts
22 you need aren't accessible. You could represent the Unicode characters
23 as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
24 user who actually wants to read what the text says.
25
26 What Text::Unidecode provides is a function, "unidecode(...)" that
27 takes Unicode data and tries to represent it in US-ASCII characters
28 (i.e., the universally displayable characters between 0x00 and 0x7F).
29 The representation is almost always an attempt at transliteration--
30 i.e., conveying, in Roman letters, the pronunciation expressed by the
31 text in some other writing system. (See the example in the synopsis.)
32
33 NOTE:
34
35 To make sure your perldoc/Pod viewing setup for viewing this page is
36 working: The six-letter word "résumé" should look like "resume" with an
37 "/" accent on each "e".
38
39 For further tests, and help if that doesn't work, see below, "A POD
40 ENCODING TEST".
41
43 Unidecode's ability to transliterate from a given language is limited
44 by two factors:
45
46 · The amount and quality of data in the written form of the original
47 language
48
49 So if you have Hebrew data that has no vowel points in it, then
50 Unidecode cannot guess what vowels should appear in a
51 pronunciation. S f y hv n vwls n th npt, y wn't gt ny vwls n th
52 tpt. (This is a specific application of the general principle of
53 "Garbage In, Garbage Out".)
54
55 · Basic limitations in the Unidecode design
56
57 Writing a real and clever transliteration algorithm for any single
58 language usually requires a lot of time, and at least a passable
59 knowledge of the language involved. But Unicode text can convey
60 more languages than I could possibly learn (much less create a
61 transliterator for) in the entire rest of my lifetime. So I put a
62 cap on how intelligent Unidecode could be, by insisting that it
63 support only context-insensitive transliteration. That means
64 missing the finer details of any given writing system, while still
65 hopefully being useful.
66
67 Unidecode, in other words, is quick and dirty. Sometimes the output is
68 not so dirty at all: Russian and Greek seem to work passably; and while
69 Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing
70 system, setting up a mapping from it to Roman letters seems to work
71 pretty well. But sometimes the output is very dirty: Unidecode does
72 quite badly on Japanese and Thai.
73
74 If you want a smarter transliteration for a particular language than
75 Unidecode provides, then you should look for (or write) a
76 transliteration algorithm specific to that language, and apply it
77 instead of (or at least before) applying Unidecode.
78
79 In other words, Unidecode's approach is broad (knowing about dozens of
80 writing systems), but shallow (not being meticulous about any of them).
81
83 Text::Unidecode provides one function, "unidecode(...)", which is
84 exported by default. It can be used in a variety of calling contexts:
85
86 "$out = unidecode( $in );" # scalar context
87 This returns a copy of $in, transliterated.
88
89 "$out = unidecode( @in );" # scalar context
90 This is the same as "$out = unidecode(join "", @in);"
91
92 "@out = unidecode( @in );" # list context
93 This returns a list consisting of copies of @in, each
94 transliterated. This is the same as "@out = map
95 scalar(unidecode($_)), @in;"
96
97 "unidecode( @items );" # void context
98 "unidecode( @bar, $foo, @baz );" # void context
99 Each item on input is replaced with its transliteration. This is
100 the same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"
101
102 You should make a minimum of assumptions about the output of
103 "unidecode(...)". For example, if you assume an all-alphabetic
104 (Unicode) string passed to "unidecode(...)" will return an all-
105 alphabetic string, you're wrong-- some alphabetic Unicode characters
106 are transliterated as strings containing punctuation (e.g., the
107 Armenian letter "Թ" (U+0539), currently transliterates as "T`"
108 (capital-T then a backtick).
109
110 However, these are the assumptions you can make:
111
112 · Each character 0x0000 - 0x007F transliterates as itself. That is,
113 "unidecode(...)" is 7-bit pure.
114
115 · The output of "unidecode(...)" always consists entirely of US-ASCII
116 characters-- i.e., characters 0x0000 - 0x007F.
117
118 · All Unicode characters translate to a sequence of (any number of)
119 characters that are newline ("\n") or in the range 0x0020-0x007E.
120 That is, no Unicode character translates to "\x01", for example.
121 (Although if you have a "\x01" on input, you'll get a "\x01" in
122 output.)
123
124 · Yes, some transliterations produce a "\n" but it's just a few, and
125 only with good reason. Note that the value of newline ("\n")
126 varies from platform to platform-- see perlport.
127
128 · Some Unicode characters may transliterate to nothing (i.e., empty
129 string).
130
131 · Very many Unicode characters transliterate to multi-character
132 sequences. E.g., Unihan character U+5317, "北", transliterates as
133 the four-character string "Bei ".
134
135 · Within these constraints, I may change the transliteration of
136 characters in future versions. For example, if someone convinces
137 me that that the Armenian letter "Թ", currently transliterated as
138 "T`", would be better transliterated as "D", I may well make that
139 change.
140
141 · Unfortunately, there are many characters that Unidecode doesn't
142 know a transliteration for. This is generally because the
143 character has been added since I last revised the Unidecode data
144 tables. I'm always catching up!
145
147 Text::Unidecode is meant to be a transliterator of last resort, to be
148 used once you've decided that you can't just display the Unicode data
149 as is, and once you've decided you don't have a more clever, language-
150 specific transliterator available, or once you've already applied
151 smarter algorithms or mappings that you prefer and you now just want
152 Unidecode to do cleanup.
153
154 Unidecode transliterates context-insensitively-- that is, a given
155 character is replaced with the same US-ASCII (7-bit ASCII) character or
156 characters, no matter what the surrounding characters are.
157
158 The main reason I'm making Text::Unidecode work with only context-
159 insensitive substitution is that it's fast, dumb, and straightforward
160 enough to be feasible. It doesn't tax my (quite limited) knowledge of
161 world languages. It doesn't require me writing a hundred lines of code
162 to get the Thai syllabification right (and never knowing whether I've
163 gotten it wrong, because I don't know Thai), or spending a year trying
164 to get Text::Unidecode to use the ChaSen algorithm for Japanese, or
165 trying to write heuristics for telling the difference between Japanese,
166 Chinese, or Korean, so it knows how to transliterate any given Uni-Han
167 glyph. And moreover, context-insensitive substitution is still mostly
168 useful, but still clearly couldn't be mistaken for authoritative.
169
170 Text::Unidecode is an example of the 80/20 rule in action-- you get 80%
171 of the usefulness using just 20% of a "real" solution.
172
173 A "real" approach to transliteration for any given language can involve
174 such increasingly tricky contextual factors as these:
175
176 The previous / preceding character(s)
177 What a given symbol "X" means, could depend on whether it's
178 followed by a consonant, or by vowel, or by some diacritic
179 character.
180
181 Syllables
182 A character "X" at end of a syllable could mean something different
183 from when it's at the start-- which is especially problematic when
184 the language involved doesn't explicitly mark where one syllable
185 stops and the next starts.
186
187 Parts of speech
188 What "X" sounds like at the end of a word, depends on whether that
189 word is a noun, or a verb, or what.
190
191 Meaning
192 By semantic context, you can tell that this ideogram "X" means
193 "shoe" (pronounced one way) and not "time" (pronounced another),
194 and that's how you know to transliterate it one way instead of the
195 other.
196
197 Origin of the word
198 "X" means one thing in loanwords and/or placenames (and derivatives
199 thereof), and another in native words.
200
201 "It's just that way"
202 "X" normally makes the /X/ sound, except for this list of seventy
203 exceptions (and words based on them, sometimes indirectly). Or:
204 you never can tell which of the three ways to pronounce "X" this
205 word actually uses; you just have to know which it is, so keep a
206 dictionary on hand!
207
208 Language
209 The character "X" is actually used in several different languages,
210 and you have to figure out which you're looking at before you can
211 determine how to transliterate it.
212
213 Out of a desire to avoid being mired in any of these kinds of
214 contextual factors, I chose to exclude all of them and just stick with
215 context-insensitive replacement.
216
218 · "Brontë" is six characters that should look like "Bronte", but with
219 double-dots on the "e" character.
220
221 · "Résumé" is six characters that should look like "Resume", but with
222 /-shaped accents on the "e" characters.
223
224 · "læti" should be four letters long-- the second letter should not
225 be two letters "ae", but should be a single letter that looks like
226 an "a" entirely fused with an "e".
227
228 · "χρονος" is six Greek characters that should look kind of like:
229 xpovoc
230
231 · "КАК ВАС ЗОВУТ" is three short Russian words that should look a lot
232 like: KAK BAC 3OBYT
233
234 · "ടധ" is two Malayalam characters that should look like: sw
235
236 · "丫二十一" is four Chinese characters that should look like: "Y=+-"
237
238 · "Hello" is five characters that should look like: Hello
239
240 If all of those come out right, your Pod viewing setup is working
241 fine-- welcome to the 2010s! If those are full of garbage characters,
242 consider viewing this page as HTML at
243 <https://metacpan.org/pod/Text::Unidecode> or
244 <http://search.cpan.org/perldoc?Text::Unidecode>
245
246 If things look mostly okay, but the Malayalam and/or the Chinese are
247 just question-marks or empty boxes, it's probably just that your
248 computer lacks the fonts for those.
249
251 Lots:
252
253 * Rebuild the Unihan database. (Talk about hitting a moving target!)
254
255 * Add tone-numbers for Mandarin hanzi? Namely: In Unihan, when tone
256 marks are present (like in "kMandarin: dào", should I continue to
257 transliterate as just "Dao", or should I put in the tone number:
258 "Dao4"? It would be pretty jarring to have digits appear where
259 previously there was just alphabetic stuff-- But tone numbers make
260 Chinese more readable. (I have a clever idea about doing this, for
261 Unidecode v2 or v3.)
262
263 * Start dealing with characters over U+FFFF. Cuneiform! Emojis!
264 Whatever!
265
266 * Fill in all the little characters that have crept into the Misc
267 Symbols Etc blocks.
268
269 * More things that need tending to are detailed in the TODO.txt file,
270 included in this distribution. Normal installs probably don't leave
271 the TODO.txt lying around, but if nothing else, you can see it at
272 <http://search.cpan.org/search?dist=Text::Unidecode>
273
275 The Text::Unidecode motto is:
276
277 It's better than nothing!
278
279 ...in both meanings: 1) seeing the output of "unidecode(...)" is better
280 than just having all font-unavailable Unicode characters replaced with
281 "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
282 nothing that Text::Unidecode's algorithm is better than. All sensible
283 transliteration algorithms (like for German, see below) are going to be
284 smarter than Unidecode's.
285
287 I will repeat the above, because some people miss it:
288
289 Text::Unidecode is meant to be a transliterator of last resort, to be
290 used once you've decided that you can't just display the Unicode data
291 as is, and once you've decided you don't have a more clever, language-
292 specific transliterator available-- or once you've already applied a
293 smarter algorithm and now just want Unidecode to do cleanup.
294
295 In other words, when you don't like what Unidecode does, do it
296 yourself. Really, that's what the above says. Here's how you would do
297 this for German, for example:
298
299 In German, there's the typographical convention that an umlaut (the
300 double-dots on: ä ö ü) can be written as an "-e", like with "Schön"
301 becoming "Schoen". But Unidecode doesn't do that-- I have Unidecode
302 simply drop the umlaut accent and give back "Schon".
303
304 (I chose this not because I'm a big meanie, but because generally
305 changing "ü" to "ue" is disastrous for all text that's not in German.
306 Finnish "Hyvää päivää" would turn into "Hyvaeae paeivaeae". And I
307 discourage you from being yet another German who emails me, trying to
308 impel me to consider a typographical nicety of German to be more
309 important than all other languages.)
310
311 If you know that the text you're handling is probably in German, and
312 you want to apply the "umlaut becomes -e" rule, here's how to do it for
313 yourself (and then use Unidecode as the fallback afterwards):
314
315 use utf8; # <-- probably necessary.
316
317 our( %German_Characters ) = qw(
318 Ä AE ä ae
319 Ö OE ö oe
320 Ü UE ü ue
321 ß ss
322 );
323
324 use Text::Unidecode qw(unidecode);
325
326 sub german_to_ascii {
327 my($german_text) = @_;
328
329 $german_text =~
330 s/([ÄäÖöÜüß])/$German_Characters{$1}/g;
331
332 # And now, as a *fallthrough*:
333 $german_text = unidecode( $german_text );
334 return $german_text;
335 }
336
337 To pick another example, here's something that's not about a specific
338 language, but simply having a preference that may or may not agree with
339 Unidecode's (i.e., mine). Consider the "¥" symbol. Unidecode changes
340 that to "Y=". If you want "¥" as "YEN", then...
341
342 use Text::Unidecode qw(unidecode);
343
344 sub my_favorite_unidecode {
345 my($text) = @_;
346
347 $text =~ s/¥/YEN/g;
348
349 # ...and anything else you like, such as:
350 $text =~ s/€/Euro/g;
351
352 # And then, as a fallback,...
353 $text = unidecode($text);
354
355 return $text;
356 }
357
358 Then if you do:
359
360 print my_favorite_unidecode("You just won ¥250,000 and €40,000!!!");
361
362 ...you'll get:
363
364 You just won YEN250,000 and Euro40,000!!!
365
366 ...just as you like it.
367
368 (By the way, the reason I don't have Unidecode just turn "¥" into "YEN"
369 is that the same symbol also stands for yuan, the Chinese currency. A
370 "Y=" is nicely, safely neutral as to whether we're talking about yen or
371 yuan-- Japan, or China.)
372
373 Another example: for hanzi/kanji/hanja, I have designed Unidecode to
374 transliterate according to the value that that character has in
375 Mandarin (otherwise Cantonese,...). Some users have complained that
376 applying Unidecode to Japanese produces gibberish.
377
378 To make a long story short: transliterating from Japanese is difficult
379 and it requires a lot of context-sensitivity. If you have text that
380 you're fairly sure is in Japanese, you're going to have to use a
381 Japanese-specific algorithm to transliterate Japanese into ASCII. (And
382 then you can call Unidecode on the output from that-- it is useful for,
383 for example, turning fullwidth characters into their normal
384 (ASCII) forms.
385
386 (Note, as of August 2016: I have titanic but tentative plans for making
387 the value of Unihan characters be something you could set parameters
388 for at runtime, in changing the order of "Mandarin else Cantonese
389 else..." in the value retrieval. Currently that preference list is
390 hardwired on my end, at module-build time. Other options I'm
391 considering allowing for: whether the Mandarin and Cantonese values
392 should have the tone numbers on them; whether every Unihan value should
393 have a terminal space; and maybe other clever stuff I haven't thought
394 of yet.)
395
397 If you get really implausible nonsense out of "unidecode(...)", make
398 sure that the input data really is a utf8 string. See perlunicode and
399 perlunitut.
400
401 Unidecode will work disastrously bad on Japanese. That's because
402 Japanese is very very hard. To extend the Unidecode motto, Unidecode
403 is better than nothing, and with Japanese, just barely!
404
405 On pure Mandarin, Unidecode will frequently give odd values-- that's
406 because a single hanzi can have several readings, and Unidecode only
407 knows what the Unihan database says is the most common one.
408
410 Thanks to (in only the sloppiest of sorta-chronological order): Jordan
411 Lachler, Harald Tveit Alvestrand, Melissa Axelrod, Abhijit Menon-Sen,
412 Mark-Jason Dominus, Joe Johnston, Conrad Heiney, fileformat.info,
413 Philip Newton, 唐鳳, Tomaž Šolc, Mike Doherty, JT Smith and the
414 MadMongers, Arden Ogg, Craig Copris, David Cusimano, Brendan Byrd, Hex
415 Martin, and many other pals who have helped with the ideas or values
416 for Unidecode's transliterations, or whose help has been in the secret
417 F5 tornado that constitutes the internals of Unidecode's
418 implementation.
419
420 And thank you to the many people who have encouraged me to plug away at
421 this project. A decade went by before I had any idea that more than
422 about 4 or 5 people were using or getting any value out of Unidecode.
423 I am told that actually my figure was missing some zeroes on the end!
424
426 Some wonderful people have ported Unidecode to other languages!
427
428 · Python: <https://pypi.python.org/pypi/Unidecode>
429
430 · PHP: <https://github.com/silverstripe-labs/silverstripe-unidecode>
431
432 · Ruby: <http://www.rubydoc.info/gems/unidecode/1.0.0/frames>
433
434 · JavaScript: <https://www.npmjs.org/package/unidecode>
435
436 · Java: <https://github.com/xuender/unidecode>
437
438 I can't vouch for the details of each port, but these are clever
439 people, so I'm sure they did a fine job.
440
442 An article I wrote for The Perl Journal about Unidecode:
443 <http://interglacial.com/tpj/22/> (READ IT!)
444
445 Jukka Korpela's <http://www.cs.tut.fi/~jkorpela/fui.html8> which is
446 brilliantly useful, and its code is brilliant (so, view source!). I
447 was kinda thinking about maybe doing something sort of like that for
448 the v2.x versions of Unicode-- but now he's got me convinced that I
449 should go right ahead.
450
451 Tom Christiansen's Perl Unicode Cookbook,
452 <http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html>
453
454 Unicode Consortium: <http://www.unicode.org/>
455
456 Searchable Unihan database:
457 <http://www.unicode.org/cgi-bin/GetUnihanData.pl>
458
459 Geoffrey Sampson. 1990. Writing Systems: A Linguistic Introduction.
460 ISBN: 0804717567
461
462 Randall K. Barry (editor). 1997. ALA-LC Romanization Tables:
463 Transliteration Schemes for Non-Roman Scripts. ISBN: 0844409405 [ALA
464 is the American Library Association; LC is the Library of Congress.]
465
466 Rupert Snell. 2000. Beginner's Hindi Script (Teach Yourself Books).
467 ISBN: 0658009109
468
470 Copyright (c) 2001, 2014, 2015, 2016 Sean M. Burke.
471
472 Unidecode is distributed under the Perl Artistic License ( perlartistic
473 ), namely:
474
475 This library is free software; you can redistribute it and/or modify it
476 under the same terms as Perl itself.
477
478 This program is distributed in the hope that it will be useful, but
479 without any warranty; without even the implied warranty of
480 merchantability or fitness for a particular purpose.
481
483 Much of Text::Unidecode's internal data is based on data from The
484 Unicode Consortium, with which I am unaffiliated. A good deal of the
485 internal data comes from suggestions that have been contributed by
486 people other than myself.
487
488 The views and conclusions contained in my software and documentation
489 are my own-- they should not be interpreted as representing official
490 policies, either expressed or implied, of The Unicode Consortium; nor
491 should they be interpreted as necessarily the views or conclusions of
492 people who have contributed to this project.
493
494 Moreover, I discourage you from inferring that choices that I've made
495 in Unidecode reflect political or linguistic prejudices on my part.
496 Just because Unidecode doesn't do great on your language, or just
497 because it might seem to do better on some another language, please
498 don't think I'm out to get you!
499
501 Your pal, Sean M. Burke "sburke@cpan.org"
502
504 If you're using Unidecode for anything interesting, be cool and email
505 me, I'm always curious what people use this for. (The answers so far
506 have surprised me!)
507
508
509
510perl v5.28.0 2016-11-26 Text::Unidecode(3)