1Locale::Maketext::TPJ13(P3eprml)Programmers Reference GLuoicdaele::Maketext::TPJ13(3pm)
2
3
4

NAME

6       Locale::Maketext::TPJ13 -- article about software localization
7

SYNOPSIS

9         # This an article, not a module.
10

DESCRIPTION

12       The following article by Sean M. Burke and Jordan Lachler first
13       appeared in The Perl Journal #13 and is copyright 1999 The Perl
14       Journal. It appears courtesy of Jon Orwant and The Perl Journal.  This
15       document may be distributed under the same terms as Perl itself.
16

Localization and Perl: gettext breaks, Maketext fixes

18       by Sean M. Burke and Jordan Lachler
19
20       This article points out cases where gettext (a common system for
21       localizing software interfaces -- i.e., making them work in the user's
22       language of choice) fails because of basic differences between human
23       languages.  This article then describes Maketext, a new system capable
24       of correctly treating these differences.
25
26   A Localization Horror Story: It Could Happen To You
27           "There are a number of languages spoken by human beings in this
28           world."
29
30           -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
31           Identification of Languages"
32
33       Imagine that your task for the day is to localize a piece of software
34       -- and luckily for you, the only output the program emits is two
35       messages, like this:
36
37         I scanned 12 directories.
38
39         Your query matched 10 files in 4 directories.
40
41       So how hard could that be?  You look at the code that produces the
42       first item, and it reads:
43
44         printf("I scanned %g directories.",
45                $directory_count);
46
47       You think about that, and realize that it doesn't even work right for
48       English, as it can produce this output:
49
50         I scanned 1 directories.
51
52       So you rewrite it to read:
53
54         printf("I scanned %g %s.",
55                $directory_count,
56                $directory_count == 1 ?
57                  "directory" : "directories",
58         );
59
60       ...which does the Right Thing.  (In case you don't recall, "%g" is for
61       locale-specific number interpolation, and "%s" is for string
62       interpolation.)
63
64       But you still have to localize it for all the languages you're
65       producing this software for, so you pull Locale::gettext off of CPAN so
66       you can access the "gettext" C functions you've heard are standard for
67       localization tasks.
68
69       And you write:
70
71         printf(gettext("I scanned %g %s."),
72                $dir_scan_count,
73                $dir_scan_count == 1 ?
74                  gettext("directory") : gettext("directories"),
75         );
76
77       But you then read in the gettext manual (Drepper, Miller, and Pinard
78       1995) that this is not a good idea, since how a single word like
79       "directory" or "directories" is translated may depend on context -- and
80       this is true, since in a case language like German or Russian, you'd
81       may need these words with a different case ending in the first instance
82       (where the word is the object of a verb) than in the second instance,
83       which you haven't even gotten to yet (where the word is the object of a
84       preposition, "in %g directories") -- assuming these keep the same
85       syntax when translated into those languages.
86
87       So, on the advice of the gettext manual, you rewrite:
88
89         printf( $dir_scan_count == 1 ?
90                  gettext("I scanned %g directory.") :
91                  gettext("I scanned %g directories."),
92                $dir_scan_count );
93
94       So, you email your various translators (the boss decides that the
95       languages du jour are Chinese, Arabic, Russian, and Italian, so you
96       have one translator for each), asking for translations for "I scanned
97       %g directory." and "I scanned %g directories.".  When they reply,
98       you'll put that in the lexicons for gettext to use when it localizes
99       your software, so that when the user is running under the "zh"
100       (Chinese) locale, gettext("I scanned %g directory.") will return the
101       appropriate Chinese text, with a "%g" in there where printf can then
102       interpolate $dir_scan.
103
104       Your Chinese translator emails right back -- he says both of these
105       phrases translate to the same thing in Chinese, because, in linguistic
106       jargon, Chinese "doesn't have number as a grammatical category" --
107       whereas English does.  That is, English has grammatical rules that
108       refer to "number", i.e., whether something is grammatically singular or
109       plural; and one of these rules is the one that forces nouns to take a
110       plural suffix (generally "s") when in a plural context, as they are
111       when they follow a number other than "one" (including, oddly enough,
112       "zero").  Chinese has no such rules, and so has just the one phrase
113       where English has two.  But, no problem, you can have this one Chinese
114       phrase appear as the translation for the two English phrases in the
115       "zh" gettext lexicon for your program.
116
117       Emboldened by this, you dive into the second phrase that your software
118       needs to output: "Your query matched 10 files in 4 directories.".  You
119       notice that if you want to treat phrases as indivisible, as the gettext
120       manual wisely advises, you need four cases now, instead of two, to
121       cover the permutations of singular and plural on the two items,
122       $dir_count and $file_count.  So you try this:
123
124         printf( $file_count == 1 ?
125           ( $directory_count == 1 ?
126            gettext("Your query matched %g file in %g directory.") :
127            gettext("Your query matched %g file in %g directories.") ) :
128           ( $directory_count == 1 ?
129            gettext("Your query matched %g files in %g directory.") :
130            gettext("Your query matched %g files in %g directories.") ),
131          $file_count, $directory_count,
132         );
133
134       (The case of "1 file in 2 [or more] directories" could, I suppose,
135       occur in the case of symlinking or something of the sort.)
136
137       It occurs to you that this is not the prettiest code you've ever
138       written, but this seems the way to go.  You mail off to the translators
139       asking for translations for these four cases.  The Chinese guy replies
140       with the one phrase that these all translate to in Chinese, and that
141       phrase has two "%g"s in it, as it should -- but there's a problem.  He
142       translates it word-for-word back: "In %g directories contains %g files
143       match your query."  The %g slots are in an order reverse to what they
144       are in English.  You wonder how you'll get gettext to handle that.
145
146       But you put it aside for the moment, and optimistically hope that the
147       other translators won't have this problem, and that their languages
148       will be better behaved -- i.e., that they will be just like English.
149
150       But the Arabic translator is the next to write back.  First off, your
151       code for "I scanned %g directory." or "I scanned %g directories."
152       assumes there's only singular or plural.  But, to use linguistic jargon
153       again, Arabic has grammatical number, like English (but unlike
154       Chinese), but it's a three-term category: singular, dual, and plural.
155       In other words, the way you say "directory" depends on whether there's
156       one directory, or two of them, or more than two of them.  Your test of
157       "($directory == 1)" no longer does the job.  And it means that where
158       English's grammatical category of number necessitates only the two
159       permutations of the first sentence based on "directory [singular]" and
160       "directories [plural]", Arabic has three -- and, worse, in the second
161       sentence ("Your query matched %g file in %g directory."), where English
162       has four, Arabic has nine.  You sense an unwelcome, exponential trend
163       taking shape.
164
165       Your Italian translator emails you back and says that "I searched 0
166       directories" (a possible English output of your program) is stilted,
167       and if you think that's fine English, that's your problem, but that
168       just will not do in the language of Dante.  He insists that where
169       $directory_count is 0, your program should produce the Italian text for
170       "I didn't scan any directories.".  And ditto for "I didn't match any
171       files in any directories", although he says the last part about "in any
172       directories" should probably just be left off.
173
174       You wonder how you'll get gettext to handle this; to accomodate the
175       ways Arabic, Chinese, and Italian deal with numbers in just these few
176       very simple phrases, you need to write code that will ask gettext for
177       different queries depending on whether the numerical values in question
178       are 1, 2, more than 2, or in some cases 0, and you still haven't
179       figured out the problem with the different word order in Chinese.
180
181       Then your Russian translator calls on the phone, to personally tell you
182       the bad news about how really unpleasant your life is about to become:
183
184       Russian, like German or Latin, is an inflectional language; that is,
185       nouns and adjectives have to take endings that depend on their case
186       (i.e., nominative, accusative, genitive, etc...) -- which is roughly a
187       matter of what role they have in syntax of the sentence -- as well as
188       on the grammatical gender (i.e., masculine, feminine, neuter) and
189       number (i.e., singular or plural) of the noun, as well as on the
190       declension class of the noun.  But unlike with most other inflected
191       languages, putting a number-phrase (like "ten" or "forty-three", or
192       their Arabic numeral equivalents) in front of noun in Russian can
193       change the case and number that noun is, and therefore the endings you
194       have to put on it.
195
196       He elaborates:  In "I scanned %g directories", you'd expect
197       "directories" to be in the accusative case (since it is the direct
198       object in the sentence) and the plural number, except where
199       $directory_count is 1, then you'd expect the singular, of course.  Just
200       like Latin or German.  But!  Where $directory_count % 10 is 1 ("%" for
201       modulo, remember), assuming $directory count is an integer, and except
202       where $directory_count % 100 is 11, "directories" is forced to become
203       grammatically singular, which means it gets the ending for the
204       accusative singular...  You begin to visualize the code it'd take to
205       test for the problem so far, and still work for Chinese and Arabic and
206       Italian, and how many gettext items that'd take, but he keeps going...
207       But where $directory_count % 10 is 2, 3, or 4 (except where
208       $directory_count % 100 is 12, 13, or 14), the word for "directories" is
209       forced to be genitive singular -- which means another ending... The
210       room begins to spin around you, slowly at first...  But with all other
211       integer values, since "directory" is an inanimate noun, when preceded
212       by a number and in the nominative or accusative cases (as it is here,
213       just your luck!), it does stay plural, but it is forced into the
214       genitive case -- yet another ending...  And you never hear him get to
215       the part about how you're going to run into similar (but maybe subtly
216       different) problems with other Slavic languages like Polish, because
217       the floor comes up to meet you, and you fade into unconsciousness.
218
219       The above cautionary tale relates how an attempt at localization can
220       lead from programmer consternation, to program obfuscation, to a need
221       for sedation.  But careful evaluation shows that your choice of tools
222       merely needed further consideration.
223
224   The Linguistic View
225           "It is more complicated than you think."
226
227           -- The Eighth Networking Truth, from RFC 1925
228
229       The field of Linguistics has expended a great deal of effort over the
230       past century trying to find grammatical patterns which hold across
231       languages; it's been a constant process of people making
232       generalizations that should apply to all languages, only to find out
233       that, all too often, these generalizations fail -- sometimes failing
234       for just a few languages, sometimes whole classes of languages, and
235       sometimes nearly every language in the world except English.  Broad
236       statistical trends are evident in what the "average language" is like
237       as far as what its rules can look like, must look like, and cannot look
238       like.  But the "average language" is just as unreal a concept as the
239       "average person" -- it runs up against the fact no language (or person)
240       is, in fact, average.  The wisdom of past experience leads us to
241       believe that any given language can do whatever it wants, in any order,
242       with appeal to any kind of grammatical categories wants -- case,
243       number, tense, real or metaphoric characteristics of the things that
244       words refer to, arbitrary or predictable classifications of words based
245       on what endings or prefixes they can take, degree or means of certainty
246       about the truth of statements expressed, and so on, ad infinitum.
247
248       Mercifully, most localization tasks are a matter of finding ways to
249       translate whole phrases, generally sentences, where the context is
250       relatively set, and where the only variation in content is usually in a
251       number being expressed -- as in the example sentences above.
252       Translating specific, fully-formed sentences is, in practice, fairly
253       foolproof -- which is good, because that's what's in the phrasebooks
254       that so many tourists rely on.  Now, a given phrase (whether in a
255       phrasebook or in a gettext lexicon) in one language might have a
256       greater or lesser applicability than that phrase's translation into
257       another language -- for example, strictly speaking, in Arabic, the
258       "your" in "Your query matched..." would take a different form depending
259       on whether the user is male or female; so the Arabic translation
260       "your[feminine] query" is applicable in fewer cases than the
261       corresponding English phrase, which doesn't distinguish the user's
262       gender.  (In practice, it's not feasable to have a program know the
263       user's gender, so the masculine "you" in Arabic is usually used, by
264       default.)
265
266       But in general, such surprises are rare when entire sentences are being
267       translated, especially when the functional context is restricted to
268       that of a computer interacting with a user either to convey a fact or
269       to prompt for a piece of information.  So, for purposes of
270       localization, translation by phrase (generally by sentence) is both the
271       simplest and the least problematic.
272
273   Breaking gettext
274           "It Has To Work."
275
276           -- First Networking Truth, RFC 1925
277
278       Consider that sentences in a tourist phrasebook are of two types: ones
279       like "How do I get to the marketplace?" that don't have any blanks to
280       fill in, and ones like "How much do these ___ cost?", where there's one
281       or more blanks to fill in (and these are usually linked to a list of
282       words that you can put in that blank: "fish", "potatoes", "tomatoes",
283       etc.)  The ones with no blanks are no problem, but the fill-in-the-
284       blank ones may not be really straightforward. If it's a Swahili
285       phrasebook, for example, the authors probably didn't bother to tell you
286       the complicated ways that the verb "cost" changes its inflectional
287       prefix depending on the noun you're putting in the blank.  The trader
288       in the marketplace will still understand what you're saying if you say
289       "how much do these potatoes cost?" with the wrong inflectional prefix
290       on "cost".  After all, you can't speak proper Swahili, you're just a
291       tourist.  But while tourists can be stupid, computers are supposed to
292       be smart; the computer should be able to fill in the blank, and still
293       have the results be grammatical.
294
295       In other words, a phrasebook entry takes some values as parameters (the
296       things that you fill in the blank or blanks), and provides a value
297       based on these parameters, where the way you get that final value from
298       the given values can, properly speaking, involve an arbitrarily complex
299       series of operations.  (In the case of Chinese, it'd be not at all
300       complex, at least in cases like the examples at the beginning of this
301       article; whereas in the case of Russian it'd be a rather complex series
302       of operations.  And in some languages, the complexity could be spread
303       around differently: while the act of putting a number-expression in
304       front of a noun phrase might not be complex by itself, it may change
305       how you have to, for example, inflect a verb elsewhere in the sentence.
306       This is what in syntax is called "long-distance dependencies".)
307
308       This talk of parameters and arbitrary complexity is just another way to
309       say that an entry in a phrasebook is what in a programming language
310       would be called a "function".  Just so you don't miss it, this is the
311       crux of this article: A phrase is a function; a phrasebook is a bunch
312       of functions.
313
314       The reason that using gettext runs into walls (as in the above second-
315       person horror story) is that you're trying to use a string (or worse, a
316       choice among a bunch of strings) to do what you really need a function
317       for -- which is futile.  Preforming (s)printf interpolation on the
318       strings which you get back from gettext does allow you to do some
319       common things passably well... sometimes... sort of; but, to paraphrase
320       what some people say about "csh" script programming, "it fools you into
321       thinking you can use it for real things, but you can't, and you don't
322       discover this until you've already spent too much time trying, and by
323       then it's too late."
324
325   Replacing gettext
326       So, what needs to replace gettext is a system that supports lexicons of
327       functions instead of lexicons of strings.  An entry in a lexicon from
328       such a system should not look like this:
329
330         "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
331
332       [\xE9 is e-acute in Latin-1.  Some pod renderers would scream if I used
333       the actual character here. -- SB]
334
335       but instead like this, bearing in mind that this is just a first stab:
336
337         sub I_found_X1_files_in_X2_directories {
338           my( $files, $dirs ) = @_[0,1];
339           $files = sprintf("%g %s", $files,
340             $files == 1 ? 'fichier' : 'fichiers');
341           $dirs = sprintf("%g %s", $dirs,
342             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
343           return "J'ai trouv\xE9 $files dans $dirs.";
344         }
345
346       Now, there's no particularly obvious way to store anything but strings
347       in a gettext lexicon; so it looks like we just have to start over and
348       make something better, from scratch.  I call my shot at a gettext-
349       replacement system "Maketext", or, in CPAN terms, Locale::Maketext.
350
351       When designing Maketext, I chose to plan its main features in terms of
352       "buzzword compliance".  And here are the buzzwords:
353
354   Buzzwords: Abstraction and Encapsulation
355       The complexity of the language you're trying to output a phrase in is
356       entirely abstracted inside (and encapsulated within) the Maketext
357       module for that interface.  When you call:
358
359         print $lang->maketext("You have [quant,_1,piece] of new mail.",
360                              scalar(@messages));
361
362       you don't know (and in fact can't easily find out) whether this will
363       involve lots of figuring, as in Russian (if $lang is a handle to the
364       Russian module), or relatively little, as in Chinese.  That kind of
365       abstraction and encapsulation may encourage other pleasant buzzwords
366       like modularization and stratification, depending on what design
367       decisions you make.
368
369   Buzzword: Isomorphism
370       "Isomorphism" means "having the same structure or form"; in discussions
371       of program design, the word takes on the special, specific meaning that
372       your implementation of a solution to a problem has the same structure
373       as, say, an informal verbal description of the solution, or maybe of
374       the problem itself.  Isomorphism is, all things considered, a good
375       thing -- it's what problem-solving (and solution-implementing) should
376       look like.
377
378       What's wrong the with gettext-using code like this...
379
380         printf( $file_count == 1 ?
381           ( $directory_count == 1 ?
382            "Your query matched %g file in %g directory." :
383            "Your query matched %g file in %g directories." ) :
384           ( $directory_count == 1 ?
385            "Your query matched %g files in %g directory." :
386            "Your query matched %g files in %g directories." ),
387          $file_count, $directory_count,
388         );
389
390       is first off that it's not well abstracted -- these ways of testing for
391       grammatical number (as in the expressions like "foo == 1 ?
392       singular_form : plural_form") should be abstracted to each language
393       module, since how you get grammatical number is language-specific.
394
395       But second off, it's not isomorphic -- the "solution" (i.e., the
396       phrasebook entries) for Chinese maps from these four English phrases to
397       the one Chinese phrase that fits for all of them.  In other words, the
398       informal solution would be "The way to say what you want in Chinese is
399       with the one phrase 'For your question, in Y directories you would find
400       X files'" -- and so the implemented solution should be, isomorphically,
401       just a straightforward way to spit out that one phrase, with numerals
402       properly interpolated.  It shouldn't have to map from the complexity of
403       other languages to the simplicity of this one.
404
405   Buzzword: Inheritance
406       There's a great deal of reuse possible for sharing of phrases between
407       modules for related dialects, or for sharing of auxiliary functions
408       between related languages.  (By "auxiliary functions", I mean functions
409       that don't produce phrase-text, but which, say, return an answer to
410       "does this number require a plural noun after it?".  Such auxiliary
411       functions would be used in the internal logic of functions that
412       actually do produce phrase-text.)
413
414       In the case of sharing phrases, consider that you have an interface
415       already localized for American English (probably by having been written
416       with that as the native locale, but that's incidental).  Localizing it
417       for UK English should, in practical terms, be just a matter of running
418       it past a British person with the instructions to indicate what few
419       phrases would benefit from a change in spelling or possibly minor
420       rewording.  In that case, you should be able to put in the UK English
421       localization module only those phrases that are UK-specific, and for
422       all the rest, inherit from the American English module.  (And I expect
423       this same situation would apply with Brazilian and Continental
424       Portugese, possbily with some very closely related languages like Czech
425       and Slovak, and possibly with the slightly different "versions" of
426       written Mandarin Chinese, as I hear exist in Taiwan and mainland
427       China.)
428
429       As to sharing of auxiliary functions, consider the problem of Russian
430       numbers from the beginning of this article; obviously, you'd want to
431       write only once the hairy code that, given a numeric value, would
432       return some specification of which case and number a given quanitified
433       noun should use.  But suppose that you discover, while localizing an
434       interface for, say, Ukranian (a Slavic language related to Russian,
435       spoken by several million people, many of whom would be relieved to
436       find that your Web site's or software's interface is available in their
437       language), that the rules in Ukranian are the same as in Russian for
438       quantification, and probably for many other grammatical functions.
439       While there may well be no phrases in common between Russian and
440       Ukranian, you could still choose to have the Ukranian module inherit
441       from the Russian module, just for the sake of inheriting all the
442       various grammatical methods.  Or, probably better organizationally, you
443       could move those functions to a module called "_E_Slavic" or something,
444       which Russian and Ukranian could inherit useful functions from, but
445       which would (presumably) provide no lexicon.
446
447   Buzzword: Concision
448       Okay, concision isn't a buzzword.  But it should be, so I decree that
449       as a new buzzword, "concision" means that simple common things should
450       be expressible in very few lines (or maybe even just a few characters)
451       of code -- call it a special case of "making simple things easy and
452       hard things possible", and see also the role it played in the
453       MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
454
455       Consider our first stab at an entry in our "phrasebook of functions":
456
457         sub I_found_X1_files_in_X2_directories {
458           my( $files, $dirs ) = @_[0,1];
459           $files = sprintf("%g %s", $files,
460             $files == 1 ? 'fichier' : 'fichiers');
461           $dirs = sprintf("%g %s", $dirs,
462             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
463           return "J'ai trouv\xE9 $files dans $dirs.";
464         }
465
466       You may sense that a lexicon (to use a non-committal catch-all term for
467       a collection of things you know how to say, regardless of whether
468       they're phrases or words) consisting of functions expressed as above
469       would make for rather long-winded and repetitive code -- even if you
470       wisely rewrote this to have quantification (as we call adding a number
471       expression to a noun phrase) be a function called like:
472
473         sub I_found_X1_files_in_X2_directories {
474           my( $files, $dirs ) = @_[0,1];
475           $files = quant($files, "fichier");
476           $dirs =  quant($dirs,  "r\xE9pertoire");
477           return "J'ai trouv\xE9 $files dans $dirs.";
478         }
479
480       And you may also sense that you do not want to bother your translators
481       with having to write Perl code -- you'd much rather that they spend
482       their very costly time on just translation.  And this is to say nothing
483       of the near impossibility of finding a commercial translator who would
484       know even simple Perl.
485
486       In a first-hack implementation of Maketext, each language-module's
487       lexicon looked like this:
488
489        %Lexicon = (
490          "I found %g files in %g directories"
491          => sub {
492             my( $files, $dirs ) = @_[0,1];
493             $files = quant($files, "fichier");
494             $dirs =  quant($dirs,  "r\xE9pertoire");
495             return "J'ai trouv\xE9 $files dans $dirs.";
496           },
497         ... and so on with other phrase => sub mappings ...
498        );
499
500       but I immediately went looking for some more concise way to basically
501       denote the same phrase-function -- a way that would also serve to
502       concisely denote most phrase-functions in the lexicon for most
503       languages.  After much time and even some actual thought, I decided on
504       this system:
505
506       * Where a value in a %Lexicon hash is a contentful string instead of an
507       anonymous sub (or, conceivably, a coderef), it would be interpreted as
508       a sort of shorthand expression of what the sub does.  When accessed for
509       the first time in a session, it is parsed, turned into Perl code, and
510       then eval'd into an anonymous sub; then that sub replaces the original
511       string in that lexicon.  (That way, the work of parsing and evaling the
512       shorthand form for a given phrase is done no more than once per
513       session.)
514
515       * Calls to "maketext" (as Maketext's main function is called) happen
516       thru a "language session handle", notionally very much like an IO
517       handle, in that you open one at the start of the session, and use it
518       for "sending signals" to an object in order to have it return the text
519       you want.
520
521       So, this:
522
523         $lang->maketext("You have [quant,_1,piece] of new mail.",
524                        scalar(@messages));
525
526       basically means this: look in the lexicon for $lang (which may inherit
527       from any number of other lexicons), and find the function that we
528       happen to associate with the string "You have [quant,_1,piece] of new
529       mail" (which is, and should be, a functioning "shorthand" for this
530       function in the native locale -- English in this case).  If you find
531       such a function, call it with $lang as its first parameter (as if it
532       were a method), and then a copy of scalar(@messages) as its second, and
533       then return that value.  If that function was found, but was in string
534       shorthand instead of being a fully specified function, parse it and
535       make it into a function before calling it the first time.
536
537       * The shorthand uses code in brackets to indicate method calls that
538       should be performed.  A full explanation is not in order here, but a
539       few examples will suffice:
540
541         "You have [quant,_1,piece] of new mail."
542
543       The above code is shorthand for, and will be interpreted as, this:
544
545         sub {
546           my $handle = $_[0];
547           my(@params) = @_;
548           return join '',
549             "You have ",
550             $handle->quant($params[1], 'piece'),
551             "of new mail.";
552         }
553
554       where "quant" is the name of a method you're using to quantify the noun
555       "piece" with the number $params[0].
556
557       A string with no brackety calls, like this:
558
559         "Your search expression was malformed."
560
561       is somewhat of a degerate case, and just gets turned into:
562
563         sub { return "Your search expression was malformed." }
564
565       However, not everything you can write in Perl code can be written in
566       the above shorthand system -- not by a long shot.  For example,
567       consider the Italian translator from the beginning of this article, who
568       wanted the Italian for "I didn't find any files" as a special case,
569       instead of "I found 0 files".  That couldn't be specified (at least not
570       easily or simply) in our shorthand system, and it would have to be
571       written out in full, like this:
572
573         sub {  # pretend the English strings are in Italian
574           my($handle, $files, $dirs) = @_[0,1,2];
575           return "I didn't find any files" unless $files;
576           return join '',
577             "I found ",
578             $handle->quant($files, 'file'),
579             " in ",
580             $handle->quant($dirs,  'directory'),
581             ".";
582         }
583
584       Next to a lexicon full of shorthand code, that sort of sticks out like
585       a sore thumb -- but this is a special case, after all; and at least
586       it's possible, if not as concise as usual.
587
588       As to how you'd implement the Russian example from the beginning of the
589       article, well, There's More Than One Way To Do It, but it could be
590       something like this (using English words for Russian, just so you know
591       what's going on):
592
593         "I [quant,_1,directory,accusative] scanned."
594
595       This shifts the burden of complexity off to the quant method.  That
596       method's parameters are: the numeric value it's going to use to
597       quantify something; the Russian word it's going to quantify; and the
598       parameter "accusative", which you're using to mean that this sentence's
599       syntax wants a noun in the accusative case there, although that
600       quantification method may have to overrule, for grammatical reasons you
601       may recall from the beginning of this article.
602
603       Now, the Russian quant method here is responsible not only for
604       implementing the strange logic necessary for figuring out how Russian
605       number-phrases impose case and number on their noun-phrases, but also
606       for inflecting the Russian word for "directory".  How that inflection
607       is to be carried out is no small issue, and among the solutions I've
608       seen, some (like variations on a simple lookup in a hash where all
609       possible forms are provided for all necessary words) are
610       straightforward but can become cumbersome when you need to inflect more
611       than a few dozen words; and other solutions (like using algorithms to
612       model the inflections, storing only root forms and irregularities) can
613       involve more overhead than is justifiable for all but the largest
614       lexicons.
615
616       Mercifully, this design decision becomes crucial only in the hairiest
617       of inflected languages, of which Russian is by no means the worst case
618       scenario, but is worse than most.  Most languages have simpler
619       inflection systems; for example, in English or Swahili, there are
620       generally no more than two possible inflected forms for a given noun
621       ("error/errors"; "kosa/makosa"), and the rules for producing these
622       forms are fairly simple -- or at least, simple rules can be formulated
623       that work for most words, and you can then treat the exceptions as just
624       "irregular", at least relative to your ad hoc rules.  A simpler
625       inflection system (simpler rules, fewer forms) means that design
626       decisions are less crucial to maintaining sanity, whereas the same
627       decisions could incur overhead-versus-scalability problems in languages
628       like Russian.  It may also be likely that code (possibly in Perl, as
629       with Lingua::EN::Inflect, for English nouns) has already been written
630       for the language in question, whether simple or complex.
631
632       Moreover, a third possibility may even be simpler than anything
633       discussed above: "Just require that all possible (or at least
634       applicable) forms be provided in the call to the given language's quant
635       method, as in:"
636
637         "I found [quant,_1,file,files]."
638
639       That way, quant just has to chose which form it needs, without having
640       to look up or generate anything.  While possibly not optimal for
641       Russian, this should work well for most other languages, where
642       quantification is not as complicated an operation.
643
644   The Devil in the Details
645       There's plenty more to Maketext than described above -- for example,
646       there's the details of how language tags ("en-US", "i-pwn", "fi", etc.)
647       or locale IDs ("en_US") interact with actual module naming
648       ("BogoQuery/Locale/en_us.pm"), and what magic can ensue; there's the
649       details of how to record (and possibly negotiate) what character
650       encoding Maketext will return text in (UTF8? Latin-1? KOI8?).  There's
651       the interesting fact that Maketext is for localization, but nowhere
652       actually has a ""use locale;"" anywhere in it.  For the curious,
653       there's the somewhat frightening details of how I actually implement
654       something like data inheritance so that searches across modules'
655       %Lexicon hashes can parallel how Perl implements method inheritance.
656
657       And, most importantly, there's all the practical details of how to
658       actually go about deriving from Maketext so you can use it for your
659       interfaces, and the various tools and conventions for starting out and
660       maintaining individual language modules.
661
662       That is all covered in the documentation for Locale::Maketext and the
663       modules that come with it, available in CPAN.  After having read this
664       article, which covers the why's of Maketext, the documentation, which
665       covers the how's of it, should be quite straightfoward.
666
667   The Proof in the Pudding: Localizing Web Sites
668       Maketext and gettext have a notable difference: gettext is in C,
669       accessible thru C library calls, whereas Maketext is in Perl, and
670       really can't work without a Perl interpreter (although I suppose
671       something like it could be written for C).  Accidents of history (and
672       not necessarily lucky ones) have made C++ the most common language for
673       the implementation of applications like word processors, Web browsers,
674       and even many in-house applications like custom query systems.  Current
675       conditions make it somewhat unlikely that the next one of any of these
676       kinds of applications will be written in Perl, albeit clearly more for
677       reasons of custom and inertia than out of consideration of what is the
678       right tool for the job.
679
680       However, other accidents of history have made Perl a well-accepted
681       language for design of server-side programs (generally in CGI form) for
682       Web site interfaces.  Localization of static pages in Web sites is
683       trivial, feasable either with simple language-negotiation features in
684       servers like Apache, or with some kind of server-side inclusions of
685       language-appropriate text into layout templates.  However, I think that
686       the localization of Perl-based search systems (or other kinds of
687       dynamic content) in Web sites, be they public or access-restricted, is
688       where Maketext will see the greatest use.
689
690       I presume that it would be only the exceptional Web site that gets
691       localized for English and Chinese and Italian and Arabic and Russian,
692       to recall the languages from the beginning of this article -- to say
693       nothing of German, Spanish, French, Japanese, Finnish, and Hindi, to
694       name a few languages that benefit from large numbers of programmers or
695       Web viewers or both.
696
697       However, the ever-increasing internationalization of the Web (whether
698       measured in terms of amount of content, of numbers of content writers
699       or programmers, or of size of content audiences) makes it increasingly
700       likely that the interface to the average Web-based dynamic content
701       service will be localized for two or maybe three languages.  It is my
702       hope that Maketext will make that task as simple as possible, and will
703       remove previous barriers to localization for languages dissimilar to
704       English.
705
706        __END__
707
708       Sean M. Burke (sburke@cpan.org) has a Master's in linguistics from
709       Northwestern University; he specializes in language technology.  Jordan
710       Lachler (lachler@unm.edu) is a PhD student in the Department of
711       Linguistics at the University of New Mexico; he specializes in
712       morphology and pedagogy of North American native languages.
713
714   References
715       Alvestrand, Harald Tveit.  1995.  RFC 1766: Tags for the Identification
716       of Languages.  "ftp://ftp.isi.edu/in-notes/rfc1766.txt" [Now see RFC
717       3066.]
718
719       Callon, Ross, editor.  1996.  RFC 1925: The Twelve Networking Truths.
720       "ftp://ftp.isi.edu/in-notes/rfc1925.txt"
721
722       Drepper, Ulrich, Peter Miller, and Francois Pinard.  1995-2001.  GNU
723       "gettext".  Available in "ftp://prep.ai.mit.edu/pub/gnu/", with
724       extensive docs in the distribution tarball.  [Since I wrote this
725       article in 1998, I now see that the gettext docs are now trying more to
726       come to terms with plurality.  Whether useful conclusions have come
727       from it is another question altogether. -- SMB, May 2001]
728
729       Forbes, Nevill.  1964.  Russian Grammar.  Third Edition, revised by J.
730       C. Dumbreck.  Oxford University Press.
731
732
733
734perl v5.12.4                      2011-06-07      Locale::Maketext::TPJ13(3pm)
Impressum