Text::Unidecode(3pm)

1Text::Unidecode(3)    User Contributed Perl Documentation   Text::Unidecode(3)
2
3
4

NAME

6       Text::Unidecode -- plain ASCII transliterations of Unicode text
7

SYNOPSIS

9         use utf8;
10         use Text::Unidecode;
11         print unidecode(
12           "北亰\n"
13           # Chinese characters for Beijing (U+5317 U+4EB0)
14         );
15
16         # That prints: Bei Jing
17

DESCRIPTION

19       It often happens that you have non-Roman text data in Unicode, but you
20       can't display it-- usually because you're trying to show it to a user
21       via an application that doesn't support Unicode, or because the fonts
22       you need aren't accessible.  You could represent the Unicode characters
23       as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
24       user who actually wants to read what the text says.
25
26       What Text::Unidecode provides is a function, unidecode(...) that takes
27       Unicode data and tries to represent it in US-ASCII characters (i.e.,
28       the universally displayable characters between 0x00 and 0x7F).  The
29       representation is almost always an attempt at transliteration-- i.e.,
30       conveying, in Roman letters, the pronunciation expressed by the text in
31       some other writing system.  (See the example in the synopsis.)
32
33       NOTE:
34
35       To make sure your perldoc/Pod viewing setup for viewing this page is
36       working: The six-letter word "résumé" should look like "resume" with an
37       "/" accent on each "e".
38
39       For further tests, and help if that doesn't work, see below, "A POD
40       ENCODING TEST".
41

DESIGN PHILOSOPHY

43       Unidecode's ability to transliterate from a given language is limited
44       by two factors:
45
46       •   The amount and quality of data in the written form of the original
47           language
48
49           So if you have Hebrew data that has no vowel points in it, then
50           Unidecode cannot guess what vowels should appear in a
51           pronunciation.  S f y hv n vwls n th npt, y wn't gt ny vwls n th
52           tpt.  (This is a specific application of the general principle of
53           "Garbage In, Garbage Out".)
54
55       •   Basic limitations in the Unidecode design
56
57           Writing a real and clever transliteration algorithm for any single
58           language usually requires a lot of time, and at least a passable
59           knowledge of the language involved.  But Unicode text can convey
60           more languages than I could possibly learn (much less create a
61           transliterator for) in the entire rest of my lifetime.  So I put a
62           cap on how intelligent Unidecode could be, by insisting that it
63           support only context-insensitive transliteration.  That means
64           missing the finer details of any given writing system, while still
65           hopefully being useful.
66
67       Unidecode, in other words, is quick and dirty.  Sometimes the output is
68       not so dirty at all: Russian and Greek seem to work passably; and while
69       Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing
70       system, setting up a mapping from it to Roman letters seems to work
71       pretty well.  But sometimes the output is very dirty: Unidecode does
72       quite badly on Japanese and Thai.
73
74       If you want a smarter transliteration for a particular language than
75       Unidecode provides, then you should look for (or write) a
76       transliteration algorithm specific to that language, and apply it
77       instead of (or at least before) applying Unidecode.
78
79       In other words, Unidecode's approach is broad (knowing about dozens of
80       writing systems), but shallow (not being meticulous about any of them).
81

FUNCTIONS

83       Text::Unidecode provides one function, unidecode(...), which is
84       exported by default.  It can be used in a variety of calling contexts:
85
86       "$out = unidecode( $in );" # scalar context
87           This returns a copy of $in, transliterated.
88
89       "$out = unidecode( @in );" # scalar context
90           This is the same as "$out = unidecode(join "", @in);"
91
92       "@out = unidecode( @in );" # list context
93           This returns a list consisting of copies of @in, each
94           transliterated.  This is the same as "@out = map
95           scalar(unidecode($_)), @in;"
96
97       "unidecode( @items );" # void context
98       "unidecode( @bar, $foo, @baz );" # void context
99           Each item on input is replaced with its transliteration.  This is
100           the same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"
101
102       You should make a minimum of assumptions about the output of
103       unidecode(...).  For example, if you assume an all-alphabetic (Unicode)
104       string passed to unidecode(...) will return an all-alphabetic string,
105       you're wrong-- some alphabetic Unicode characters are transliterated as
106       strings containing punctuation (e.g., the Armenian letter "Թ" (U+0539),
107       currently transliterates as "T`" (capital-T then a backtick).
108
109       However, these are the assumptions you can make:
110
111       •   Each character 0x0000 - 0x007F transliterates as itself.  That is,
112           unidecode(...) is 7-bit pure.
113
114       •   The output of unidecode(...) always consists entirely of US-ASCII
115           characters-- i.e., characters 0x0000 - 0x007F.
116
117       •   All Unicode characters translate to a sequence of (any number of)
118           characters that are newline ("\n") or in the range 0x0020-0x007E.
119           That is, no Unicode character translates to "\x01", for example.
120           (Although if you have a "\x01" on input, you'll get a "\x01" in
121           output.)
122
123       •   Yes, some transliterations produce a "\n" but it's just a few, and
124           only with good reason.  Note that the value of newline ("\n")
125           varies from platform to platform-- see perlport.
126
127       •   Some Unicode characters may transliterate to nothing (i.e., empty
128           string).
129
130       •   Very many Unicode characters transliterate to multi-character
131           sequences.  E.g., Unihan character U+5317, "北", transliterates as
132           the four-character string "Bei ".
133
134       •   Within these constraints, I may change the transliteration of
135           characters in future versions.  For example, if someone convinces
136           me that that the Armenian letter "Թ", currently transliterated as
137           "T`", would be better transliterated as "D", I may well make that
138           change.
139
140       •   Unfortunately, there are many characters that Unidecode doesn't
141           know a transliteration for.  This is generally because the
142           character has been added since I last revised the Unidecode data
143           tables.  I'm always catching up!
144

DESIGN GOALS AND CONSTRAINTS

146       Text::Unidecode is meant to be a transliterator of last resort, to be
147       used once you've decided that you can't just display the Unicode data
148       as is, and once you've decided you don't have a more clever, language-
149       specific transliterator available, or once you've already applied
150       smarter algorithms or mappings that you prefer and you now just want
151       Unidecode to do cleanup.
152
153       Unidecode transliterates context-insensitively-- that is, a given
154       character is replaced with the same US-ASCII (7-bit ASCII) character or
155       characters, no matter what the surrounding characters are.
156
157       The main reason I'm making Text::Unidecode work with only context-
158       insensitive substitution is that it's fast, dumb, and straightforward
159       enough to be feasible.  It doesn't tax my (quite limited) knowledge of
160       world languages.  It doesn't require me writing a hundred lines of code
161       to get the Thai syllabification right (and never knowing whether I've
162       gotten it wrong, because I don't know Thai), or spending a year trying
163       to get Text::Unidecode to use the ChaSen algorithm for Japanese, or
164       trying to write heuristics for telling the difference between Japanese,
165       Chinese, or Korean, so it knows how to transliterate any given Uni-Han
166       glyph.  And moreover, context-insensitive substitution is still mostly
167       useful, but still clearly couldn't be mistaken for authoritative.
168
169       Text::Unidecode is an example of the 80/20 rule in action-- you get 80%
170       of the usefulness using just 20% of a "real" solution.
171
172       A "real" approach to transliteration for any given language can involve
173       such increasingly tricky contextual factors as these:
174
175       The previous / preceding character(s)
176           What a given symbol "X" means, could depend on whether it's
177           followed by a consonant, or by vowel, or by some diacritic
178           character.
179
180       Syllables
181           A character "X" at end of a syllable could mean something different
182           from when it's at the start-- which is especially problematic when
183           the language involved doesn't explicitly mark where one syllable
184           stops and the next starts.
185
186       Parts of speech
187           What "X" sounds like at the end of a word, depends on whether that
188           word is a noun, or a verb, or what.
189
190       Meaning
191           By semantic context, you can tell that this ideogram "X" means
192           "shoe" (pronounced one way) and not "time" (pronounced another),
193           and that's how you know to transliterate it one way instead of the
194           other.
195
196       Origin of the word
197           "X" means one thing in loanwords and/or placenames (and derivatives
198           thereof), and another in native words.
199
200       "It's just that way"
201           "X" normally makes the /X/ sound, except for this list of seventy
202           exceptions (and words based on them, sometimes indirectly).  Or:
203           you never can tell which of the three ways to pronounce "X" this
204           word actually uses; you just have to know which it is, so keep a
205           dictionary on hand!
206
207       Language
208           The character "X" is actually used in several different languages,
209           and you have to figure out which you're looking at before you can
210           determine how to transliterate it.
211
212       Out of a desire to avoid being mired in any of these kinds of
213       contextual factors, I chose to exclude all of them and just stick with
214       context-insensitive replacement.
215

A POD ENCODING TEST

217       •   "Brontë" is six characters that should look like "Bronte", but with
218           double-dots on the "e" character.
219
220       •   "Résumé" is six characters that should look like "Resume", but with
221           /-shaped accents on the "e" characters.
222
223       •   "læti" should be four letters long-- the second letter should not
224           be two letters "ae", but should be a single letter that looks like
225           an "a" entirely fused with an "e".
226
227       •   "χρονος" is six Greek characters that should look kind of like:
228           xpovoc
229
230       •   "КАК ВАС ЗОВУТ" is three short Russian words that should look a lot
231           like: KAK BAC 3OBYT
232
233       •   "ടധ" is two Malayalam characters that should look like: sw
234
235       •   "丫二十一" is four Chinese characters that should look like: "Y=+-"
236
237       •   "Ｈｅｌｌｏ" is five characters that should look like: Hello
238
239       If all of those come out right, your Pod viewing setup is working
240       fine-- welcome to the 2010s!  If those are full of garbage characters,
241       consider viewing this page as HTML at
242       <https://metacpan.org/pod/Text::Unidecode> or
243       <http://search.cpan.org/perldoc?Text::Unidecode>
244
245       If things look mostly okay, but the Malayalam and/or the Chinese are
246       just question-marks or empty boxes, it's probably just that your
247       computer lacks the fonts for those.
248

TODO

250       Lots:
251
252       * Rebuild the Unihan database.  (Talk about hitting a moving target!)
253
254       * Add tone-numbers for Mandarin hanzi?  Namely: In Unihan, when tone
255       marks are present (like in "kMandarin: dào", should I continue to
256       transliterate as just "Dao", or should I put in the tone number:
257       "Dao4"?  It would be pretty jarring to have digits appear where
258       previously there was just alphabetic stuff-- But tone numbers make
259       Chinese more readable.  (I have a clever idea about doing this, for
260       Unidecode v2 or v3.)
261
262       * Start dealing with characters over U+FFFF.  Cuneiform! Emojis!
263       Whatever!
264
265       * Fill in all the little characters that have crept into the Misc
266       Symbols Etc blocks.
267
268       * More things that need tending to are detailed in the TODO.txt file,
269       included in this distribution.  Normal installs probably don't leave
270       the TODO.txt lying around, but if nothing else, you can see it at
271       <http://search.cpan.org/search?dist=Text::Unidecode>
272

MOTTO

274       The Text::Unidecode motto is:
275
276         It's better than nothing!
277
278       ...in both meanings: 1) seeing the output of unidecode(...) is better
279       than just having all font-unavailable Unicode characters replaced with
280       "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
281       nothing that Text::Unidecode's algorithm is better than.  All sensible
282       transliteration algorithms (like for German, see below) are going to be
283       smarter than Unidecode's.
284

WHEN YOU DON'T LIKE WHAT UNIDECODE DOES

286       I will repeat the above, because some people miss it:
287
288       Text::Unidecode is meant to be a transliterator of last resort, to be
289       used once you've decided that you can't just display the Unicode data
290       as is, and once you've decided you don't have a more clever, language-
291       specific transliterator available-- or once you've already applied a
292       smarter algorithm and now just want Unidecode to do cleanup.
293
294       In other words, when you don't like what Unidecode does, do it
295       yourself.  Really, that's what the above says.  Here's how you would do
296       this for German, for example:
297
298       In German, there's the typographical convention that an umlaut (the
299       double-dots on: ä ö ü) can be written as an "-e", like with "Schön"
300       becoming "Schoen".  But Unidecode doesn't do that-- I have Unidecode
301       simply drop the umlaut accent and give back "Schon".
302
303       (I chose this not because I'm a big meanie, but because generally
304       changing "ü" to "ue" is disastrous for all text that's not in German.
305       Finnish "Hyvää päivää" would turn into "Hyvaeae paeivaeae".  And I
306       discourage you from being yet another German who emails me, trying to
307       impel me to consider a typographical nicety of German to be more
308       important than all other languages.)
309
310       If you know that the text you're handling is probably in German, and
311       you want to apply the "umlaut becomes -e" rule, here's how to do it for
312       yourself (and then use Unidecode as the fallback afterwards):
313
314         use utf8;  # <-- probably necessary.
315
316         our( %German_Characters ) = qw(
317          Ä AE   ä ae
318          Ö OE   ö oe
319          Ü UE   ü ue
320          ß ss
321         );
322
323         use Text::Unidecode qw(unidecode);
324
325         sub german_to_ascii {
326           my($german_text) = @_;
327
328           $german_text =~
329             s/([ÄäÖöÜüß])/$German_Characters{$1}/g;
330
331           # And now, as a *fallthrough*:
332           $german_text = unidecode( $german_text );
333           return $german_text;
334         }
335
336       To pick another example, here's something that's not about a specific
337       language, but simply having a preference that may or may not agree with
338       Unidecode's (i.e., mine).  Consider the "¥" symbol.  Unidecode changes
339       that to "Y=".  If you want "¥" as "YEN", then...
340
341         use Text::Unidecode qw(unidecode);
342
343         sub my_favorite_unidecode {
344           my($text) = @_;
345
346           $text =~ s/¥/YEN/g;
347
348           # ...and anything else you like, such as:
349           $text =~ s/€/Euro/g;
350
351           # And then, as a fallback,...
352           $text = unidecode($text);
353
354           return $text;
355         }
356
357       Then if you do:
358
359         print my_favorite_unidecode("You just won ¥250,000 and €40,000!!!");
360
361       ...you'll get:
362
363         You just won YEN250,000 and Euro40,000!!!
364
365       ...just as you like it.
366
367       (By the way, the reason I don't have Unidecode just turn "¥" into "YEN"
368       is that the same symbol also stands for yuan, the Chinese currency.  A
369       "Y=" is nicely, safely neutral as to whether we're talking about yen or
370       yuan-- Japan, or China.)
371
372       Another example: for hanzi/kanji/hanja, I have designed Unidecode to
373       transliterate according to the value that that character has in
374       Mandarin (otherwise Cantonese,...).  Some users have complained that
375       applying Unidecode to Japanese produces gibberish.
376
377       To make a long story short: transliterating from Japanese is difficult
378       and it requires a lot of context-sensitivity.  If you have text that
379       you're fairly sure is in Japanese, you're going to have to use a
380       Japanese-specific algorithm to transliterate Japanese into ASCII.  (And
381       then you can call Unidecode on the output from that-- it is useful for,
382       for example, turning ｆｕｌｌｗｉｄｔｈ characters into their normal
383       (ASCII) forms.
384
385       (Note, as of August 2016: I have titanic but tentative plans for making
386       the value of Unihan characters be something you could set parameters
387       for at runtime, in changing the order of "Mandarin else Cantonese
388       else..." in the value retrieval.  Currently that preference list is
389       hardwired on my end, at module-build time.  Other options I'm
390       considering allowing for: whether the Mandarin and Cantonese values
391       should have the tone numbers on them; whether every Unihan value should
392       have a terminal space; and maybe other clever stuff I haven't thought
393       of yet.)
394

CAVEATS

396       If you get really implausible nonsense out of unidecode(...), make sure
397       that the input data really is a utf8 string.  See perlunicode and
398       perlunitut.
399
400       Unidecode will work disastrously bad on Japanese. That's because
401       Japanese is very very hard.  To extend the Unidecode motto, Unidecode
402       is better than nothing, and with Japanese, just barely!
403
404       On pure Mandarin, Unidecode will frequently give odd values-- that's
405       because a single hanzi can have several readings, and Unidecode only
406       knows what the Unihan database says is the most common one.
407

THANKS

409       Thanks to (in only the sloppiest of sorta-chronological order): Jordan
410       Lachler, Harald Tveit Alvestrand, Melissa Axelrod, Abhijit Menon-Sen,
411       Mark-Jason Dominus, Joe Johnston, Conrad Heiney, fileformat.info,
412       Philip Newton, 唐鳳, Tomaž Šolc, Mike Doherty, JT Smith and the
413       MadMongers, Arden Ogg, Craig Copris, David Cusimano, Brendan Byrd, Hex
414       Martin, and many other pals who have helped with the ideas or values
415       for Unidecode's transliterations, or whose help has been in the secret
416       F5 tornado that constitutes the internals of Unidecode's
417       implementation.
418
419       And thank you to the many people who have encouraged me to plug away at
420       this project.  A decade went by before I had any idea that more than
421       about 4 or 5 people were using or getting any value out of Unidecode.
422       I am told that actually my figure was missing some zeroes on the end!
423

PORTS

425       Some wonderful people have ported Unidecode to other languages!
426
427       •   Python: <https://pypi.python.org/pypi/Unidecode>
428
429       •   PHP: <https://github.com/silverstripe-labs/silverstripe-unidecode>
430
431       •   Ruby: <http://www.rubydoc.info/gems/unidecode/1.0.0/frames>
432
433       •   JavaScript: <https://www.npmjs.org/package/unidecode>
434
435       •   Java: <https://github.com/xuender/unidecode>
436
437       I can't vouch for the details of each port, but these are clever
438       people, so I'm sure they did a fine job.
439

LICENSE

469       Copyright (c) 2001, 2014, 2015, 2016 Sean M. Burke.
470
471       Unidecode is distributed under the Perl Artistic License ( perlartistic
472       ), namely:
473
474       This library is free software; you can redistribute it and/or modify it
475       under the same terms as Perl itself.
476
477       This program is distributed in the hope that it will be useful, but
478       without any warranty; without even the implied warranty of
479       merchantability or fitness for a particular purpose.
480

DISCLAIMER

482       Much of Text::Unidecode's internal data is based on data from The
483       Unicode Consortium, with which I am unaffiliated.  A good deal of the
484       internal data comes from suggestions that have been contributed by
485       people other than myself.
486
487       The views and conclusions contained in my software and documentation
488       are my own-- they should not be interpreted as representing official
489       policies, either expressed or implied, of The Unicode Consortium; nor
490       should they be interpreted as necessarily the views or conclusions of
491       people who have contributed to this project.
492
493       Moreover, I discourage you from inferring that choices that I've made
494       in Unidecode reflect political or linguistic prejudices on my part.
495       Just because Unidecode doesn't do great on your language, or just
496       because it might seem to do better on some another language, please
497       don't think I'm out to get you!
498

AUTHOR

500       Your pal, Sean M. Burke "sburke@cpan.org"
501

O HAI!

503       If you're using Unidecode for anything interesting, be cool and email
504       me, I'm always curious what people use this for.  (The answers so far
505       have surprised me!)
506
507
508
509perl v5.36.0                      2023-01-20                Text::Unidecode(3)