1Locale::Maketext::TPJ13(P3eprml)Programmers Reference GLuoicdaele::Maketext::TPJ13(3pm)
2
3
4

NAME

6       Locale::Maketext::TPJ13 -- article about software localization
7

SYNOPSIS

9         # This an article, not a module.
10

DESCRIPTION

12       The following article by Sean M. Burke and Jordan Lachler first
13       appeared in The Perl Journal #13 and is copyright 1999 The Perl Jour‐
14       nal. It appears courtesy of Jon Orwant and The Perl Journal.  This doc‐
15       ument may be distributed under the same terms as Perl itself.
16

Localization and Perl: gettext breaks, Maketext fixes

18       by Sean M. Burke and Jordan Lachler
19
20       This article points out cases where gettext (a common system for local‐
21       izing software interfaces -- i.e., making them work in the user's lan‐
22       guage of choice) fails because of basic differences between human lan‐
23       guages.  This article then describes Maketext, a new system capable of
24       correctly treating these differences.
25
26       A Localization Horror Story: It Could Happen To You
27
28           "There are a number of languages spoken by human beings in this
29           world."
30
31           -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the Identifica‐
32           tion of Languages"
33
34       Imagine that your task for the day is to localize a piece of software
35       -- and luckily for you, the only output the program emits is two mes‐
36       sages, like this:
37
38         I scanned 12 directories.
39
40         Your query matched 10 files in 4 directories.
41
42       So how hard could that be?  You look at the code that produces the
43       first item, and it reads:
44
45         printf("I scanned %g directories.",
46                $directory_count);
47
48       You think about that, and realize that it doesn't even work right for
49       English, as it can produce this output:
50
51         I scanned 1 directories.
52
53       So you rewrite it to read:
54
55         printf("I scanned %g %s.",
56                $directory_count,
57                $directory_count == 1 ?
58                  "directory" : "directories",
59         );
60
61       ...which does the Right Thing.  (In case you don't recall, "%g" is for
62       locale-specific number interpolation, and "%s" is for string interpola‐
63       tion.)
64
65       But you still have to localize it for all the languages you're produc‐
66       ing this software for, so you pull Locale::gettext off of CPAN so you
67       can access the "gettext" C functions you've heard are standard for
68       localization tasks.
69
70       And you write:
71
72         printf(gettext("I scanned %g %s."),
73                $dir_scan_count,
74                $dir_scan_count == 1 ?
75                  gettext("directory") : gettext("directories"),
76         );
77
78       But you then read in the gettext manual (Drepper, Miller, and Pinard
79       1995) that this is not a good idea, since how a single word like
80       "directory" or "directories" is translated may depend on context -- and
81       this is true, since in a case language like German or Russian, you'd
82       may need these words with a different case ending in the first instance
83       (where the word is the object of a verb) than in the second instance,
84       which you haven't even gotten to yet (where the word is the object of a
85       preposition, "in %g directories") -- assuming these keep the same syn‐
86       tax when translated into those languages.
87
88       So, on the advice of the gettext manual, you rewrite:
89
90         printf( $dir_scan_count == 1 ?
91                  gettext("I scanned %g directory.") :
92                  gettext("I scanned %g directories."),
93                $dir_scan_count );
94
95       So, you email your various translators (the boss decides that the lan‐
96       guages du jour are Chinese, Arabic, Russian, and Italian, so you have
97       one translator for each), asking for translations for "I scanned %g
98       directory." and "I scanned %g directories.".  When they reply, you'll
99       put that in the lexicons for gettext to use when it localizes your
100       software, so that when the user is running under the "zh" (Chinese)
101       locale, gettext("I scanned %g directory.") will return the appropriate
102       Chinese text, with a "%g" in there where printf can then interpolate
103       $dir_scan.
104
105       Your Chinese translator emails right back -- he says both of these
106       phrases translate to the same thing in Chinese, because, in linguistic
107       jargon, Chinese "doesn't have number as a grammatical category" --
108       whereas English does.  That is, English has grammatical rules that
109       refer to "number", i.e., whether something is grammatically singular or
110       plural; and one of these rules is the one that forces nouns to take a
111       plural suffix (generally "s") when in a plural context, as they are
112       when they follow a number other than "one" (including, oddly enough,
113       "zero").  Chinese has no such rules, and so has just the one phrase
114       where English has two.  But, no problem, you can have this one Chinese
115       phrase appear as the translation for the two English phrases in the
116       "zh" gettext lexicon for your program.
117
118       Emboldened by this, you dive into the second phrase that your software
119       needs to output: "Your query matched 10 files in 4 directories.".  You
120       notice that if you want to treat phrases as indivisible, as the gettext
121       manual wisely advises, you need four cases now, instead of two, to
122       cover the permutations of singular and plural on the two items,
123       $dir_count and $file_count.  So you try this:
124
125         printf( $file_count == 1 ?
126           ( $directory_count == 1 ?
127            gettext("Your query matched %g file in %g directory.") :
128            gettext("Your query matched %g file in %g directories.") ) :
129           ( $directory_count == 1 ?
130            gettext("Your query matched %g files in %g directory.") :
131            gettext("Your query matched %g files in %g directories.") ),
132          $file_count, $directory_count,
133         );
134
135       (The case of "1 file in 2 [or more] directories" could, I suppose,
136       occur in the case of symlinking or something of the sort.)
137
138       It occurs to you that this is not the prettiest code you've ever writ‐
139       ten, but this seems the way to go.  You mail off to the translators
140       asking for translations for these four cases.  The Chinese guy replies
141       with the one phrase that these all translate to in Chinese, and that
142       phrase has two "%g"s in it, as it should -- but there's a problem.  He
143       translates it word-for-word back: "In %g directories contains %g files
144       match your query."  The %g slots are in an order reverse to what they
145       are in English.  You wonder how you'll get gettext to handle that.
146
147       But you put it aside for the moment, and optimistically hope that the
148       other translators won't have this problem, and that their languages
149       will be better behaved -- i.e., that they will be just like English.
150
151       But the Arabic translator is the next to write back.  First off, your
152       code for "I scanned %g directory." or "I scanned %g directories."
153       assumes there's only singular or plural.  But, to use linguistic jargon
154       again, Arabic has grammatical number, like English (but unlike Chi‐
155       nese), but it's a three-term category: singular, dual, and plural.  In
156       other words, the way you say "directory" depends on whether there's one
157       directory, or two of them, or more than two of them.  Your test of
158       "($directory == 1)" no longer does the job.  And it means that where
159       English's grammatical category of number necessitates only the two per‐
160       mutations of the first sentence based on "directory [singular]" and
161       "directories [plural]", Arabic has three -- and, worse, in the second
162       sentence ("Your query matched %g file in %g directory."), where English
163       has four, Arabic has nine.  You sense an unwelcome, exponential trend
164       taking shape.
165
166       Your Italian translator emails you back and says that "I searched 0
167       directories" (a possible English output of your program) is stilted,
168       and if you think that's fine English, that's your problem, but that
169       just will not do in the language of Dante.  He insists that where
170       $directory_count is 0, your program should produce the Italian text for
171       "I didn't scan any directories.".  And ditto for "I didn't match any
172       files in any directories", although he says the last part about "in any
173       directories" should probably just be left off.
174
175       You wonder how you'll get gettext to handle this; to accomodate the
176       ways Arabic, Chinese, and Italian deal with numbers in just these few
177       very simple phrases, you need to write code that will ask gettext for
178       different queries depending on whether the numerical values in question
179       are 1, 2, more than 2, or in some cases 0, and you still haven't fig‐
180       ured out the problem with the different word order in Chinese.
181
182       Then your Russian translator calls on the phone, to personally tell you
183       the bad news about how really unpleasant your life is about to become:
184
185       Russian, like German or Latin, is an inflectional language; that is,
186       nouns and adjectives have to take endings that depend on their case
187       (i.e., nominative, accusative, genitive, etc...) -- which is roughly a
188       matter of what role they have in syntax of the sentence -- as well as
189       on the grammatical gender (i.e., masculine, feminine, neuter) and num‐
190       ber (i.e., singular or plural) of the noun, as well as on the declen‐
191       sion class of the noun.  But unlike with most other inflected lan‐
192       guages, putting a number-phrase (like "ten" or "forty-three", or their
193       Arabic numeral equivalents) in front of noun in Russian can change the
194       case and number that noun is, and therefore the endings you have to put
195       on it.
196
197       He elaborates:  In "I scanned %g directories", you'd expect "directo‐
198       ries" to be in the accusative case (since it is the direct object in
199       the sentnce) and the plural number, except where $directory_count is 1,
200       then you'd expect the singular, of course.  Just like Latin or German.
201       But!  Where $directory_count % 10 is 1 ("%" for modulo, remember),
202       assuming $directory count is an integer, and except where $direc‐
203       tory_count % 100 is 11, "directories" is forced to become grammatically
204       singular, which means it gets the ending for the accusative singular...
205       You begin to visualize the code it'd take to test for the problem so
206       far, and still work for Chinese and Arabic and Italian, and how many
207       gettext items that'd take, but he keeps going...  But where $direc‐
208       tory_count % 10 is 2, 3, or 4 (except where $directory_count % 100 is
209       12, 13, or 14), the word for "directories" is forced to be genitive
210       singular -- which means another ending... The room begins to spin
211       around you, slowly at first...  But with all other integer values,
212       since "directory" is an inanimate noun, when preceded by a number and
213       in the nominative or accusative cases (as it is here, just your luck!),
214       it does stay plural, but it is forced into the genitive case -- yet
215       another ending...  And you never hear him get to the part about how
216       you're going to run into similar (but maybe subtly different) problems
217       with other Slavic languages like Polish, because the floor comes up to
218       meet you, and you fade into unconsciousness.
219
220       The above cautionary tale relates how an attempt at localization can
221       lead from programmer consternation, to program obfuscation, to a need
222       for sedation.  But careful evaluation shows that your choice of tools
223       merely needed further consideration.
224
225       The Linguistic View
226
227           "It is more complicated than you think."
228
229           -- The Eighth Networking Truth, from RFC 1925
230
231       The field of Linguistics has expended a great deal of effort over the
232       past century trying to find grammatical patterns which hold across lan‐
233       guages; it's been a constant process of people making generalizations
234       that should apply to all languages, only to find out that, all too
235       often, these generalizations fail -- sometimes failing for just a few
236       languages, sometimes whole classes of languages, and sometimes nearly
237       every language in the world except English.  Broad statistical trends
238       are evident in what the "average language" is like as far as what its
239       rules can look like, must look like, and cannot look like.  But the
240       "average language" is just as unreal a concept as the "average person"
241       -- it runs up against the fact no language (or person) is, in fact,
242       average.  The wisdom of past experience leads us to believe that any
243       given language can do whatever it wants, in any order, with appeal to
244       any kind of grammatical categories wants -- case, number, tense, real
245       or metaphoric characteristics of the things that words refer to, arbi‐
246       trary or predictable classifications of words based on what endings or
247       prefixes they can take, degree or means of certainty about the truth of
248       statements expressed, and so on, ad infinitum.
249
250       Mercifully, most localization tasks are a matter of finding ways to
251       translate whole phrases, generally sentences, where the context is rel‐
252       atively set, and where the only variation in content is usually in a
253       number being expressed -- as in the example sentences above.  Translat‐
254       ing specific, fully-formed sentences is, in practice, fairly foolproof
255       -- which is good, because that's what's in the phrasebooks that so many
256       tourists rely on.  Now, a given phrase (whether in a phrasebook or in a
257       gettext lexicon) in one language might have a greater or lesser appli‐
258       cability than that phrase's translation into another language -- for
259       example, strictly speaking, in Arabic, the "your" in "Your query
260       matched..." would take a different form depending on whether the user
261       is male or female; so the Arabic translation "your[feminine] query" is
262       applicable in fewer cases than the corresponding English phrase, which
263       doesn't distinguish the user's gender.  (In practice, it's not feasable
264       to have a program know the user's gender, so the masculine "you" in
265       Arabic is usually used, by default.)
266
267       But in general, such surprises are rare when entire sentences are being
268       translated, especially when the functional context is restricted to
269       that of a computer interacting with a user either to convey a fact or
270       to prompt for a piece of information.  So, for purposes of localiza‐
271       tion, translation by phrase (generally by sentence) is both the sim‐
272       plest and the least problematic.
273
274       Breaking gettext
275
276           "It Has To Work."
277
278           -- First Networking Truth, RFC 1925
279
280       Consider that sentences in a tourist phrasebook are of two types: ones
281       like "How do I get to the marketplace?" that don't have any blanks to
282       fill in, and ones like "How much do these ___ cost?", where there's one
283       or more blanks to fill in (and these are usually linked to a list of
284       words that you can put in that blank: "fish", "potatoes", "tomatoes",
285       etc.)  The ones with no blanks are no problem, but the fill-in-the-
286       blank ones may not be really straightforward. If it's a Swahili phrase‐
287       book, for example, the authors probably didn't bother to tell you the
288       complicated ways that the verb "cost" changes its inflectional prefix
289       depending on the noun you're putting in the blank.  The trader in the
290       marketplace will still understand what you're saying if you say "how
291       much do these potatoes cost?" with the wrong inflectional prefix on
292       "cost".  After all, you can't speak proper Swahili, you're just a
293       tourist.  But while tourists can be stupid, computers are supposed to
294       be smart; the computer should be able to fill in the blank, and still
295       have the results be grammatical.
296
297       In other words, a phrasebook entry takes some values as parameters (the
298       things that you fill in the blank or blanks), and provides a value
299       based on these parameters, where the way you get that final value from
300       the given values can, properly speaking, involve an arbitrarily complex
301       series of operations.  (In the case of Chinese, it'd be not at all com‐
302       plex, at least in cases like the examples at the beginning of this
303       article; whereas in the case of Russian it'd be a rather complex series
304       of operations.  And in some languages, the complexity could be spread
305       around differently: while the act of putting a number-expression in
306       front of a noun phrase might not be complex by itself, it may change
307       how you have to, for example, inflect a verb elsewhere in the sentence.
308       This is what in syntax is called "long-distance dependencies".)
309
310       This talk of parameters and arbitrary complexity is just another way to
311       say that an entry in a phrasebook is what in a programming language
312       would be called a "function".  Just so you don't miss it, this is the
313       crux of this article: A phrase is a function; a phrasebook is a bunch
314       of functions.
315
316       The reason that using gettext runs into walls (as in the above second-
317       person horror story) is that you're trying to use a string (or worse, a
318       choice among a bunch of strings) to do what you really need a function
319       for -- which is futile.  Preforming (s)printf interpolation on the
320       strings which you get back from gettext does allow you to do some com‐
321       mon things passably well... sometimes... sort of; but, to paraphrase
322       what some people say about "csh" script programming, "it fools you into
323       thinking you can use it for real things, but you can't, and you don't
324       discover this until you've already spent too much time trying, and by
325       then it's too late."
326
327       Replacing gettext
328
329       So, what needs to replace gettext is a system that supports lexicons of
330       functions instead of lexicons of strings.  An entry in a lexicon from
331       such a system should not look like this:
332
333         "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"
334
335       [\xE9 is e-acute in Latin-1.  Some pod renderers would scream if I used
336       the actual character here. -- SB]
337
338       but instead like this, bearing in mind that this is just a first stab:
339
340         sub I_found_X1_files_in_X2_directories {
341           my( $files, $dirs ) = @_[0,1];
342           $files = sprintf("%g %s", $files,
343             $files == 1 ? 'fichier' : 'fichiers');
344           $dirs = sprintf("%g %s", $dirs,
345             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
346           return "J'ai trouv\xE9 $files dans $dirs.";
347         }
348
349       Now, there's no particularly obvious way to store anything but strings
350       in a gettext lexicon; so it looks like we just have to start over and
351       make something better, from scratch.  I call my shot at a gettext-
352       replacement system "Maketext", or, in CPAN terms, Locale::Maketext.
353
354       When designing Maketext, I chose to plan its main features in terms of
355       "buzzword compliance".  And here are the buzzwords:
356
357       Buzzwords: Abstraction and Encapsulation
358
359       The complexity of the language you're trying to output a phrase in is
360       entirely abstracted inside (and encapsulated within) the Maketext mod‐
361       ule for that interface.  When you call:
362
363         print $lang->maketext("You have [quant,_1,piece] of new mail.",
364                              scalar(@messages));
365
366       you don't know (and in fact can't easily find out) whether this will
367       involve lots of figuring, as in Russian (if $lang is a handle to the
368       Russian module), or relatively little, as in Chinese.  That kind of
369       abstraction and encapsulation may encourage other pleasant buzzwords
370       like modularization and stratification, depending on what design deci‐
371       sions you make.
372
373       Buzzword: Isomorphism
374
375       "Isomorphism" means "having the same structure or form"; in discussions
376       of program design, the word takes on the special, specific meaning that
377       your implementation of a solution to a problem has the same structure
378       as, say, an informal verbal description of the solution, or maybe of
379       the problem itself.  Isomorphism is, all things considered, a good
380       thing -- it's what problem-solving (and solution-implementing) should
381       look like.
382
383       What's wrong the with gettext-using code like this...
384
385         printf( $file_count == 1 ?
386           ( $directory_count == 1 ?
387            "Your query matched %g file in %g directory." :
388            "Your query matched %g file in %g directories." ) :
389           ( $directory_count == 1 ?
390            "Your query matched %g files in %g directory." :
391            "Your query matched %g files in %g directories." ),
392          $file_count, $directory_count,
393         );
394
395       is first off that it's not well abstracted -- these ways of testing for
396       grammatical number (as in the expressions like "foo == 1 ?  singu‐
397       lar_form : plural_form") should be abstracted to each language module,
398       since how you get grammatical number is language-specific.
399
400       But second off, it's not isomorphic -- the "solution" (i.e., the
401       phrasebook entries) for Chinese maps from these four English phrases to
402       the one Chinese phrase that fits for all of them.  In other words, the
403       informal solution would be "The way to say what you want in Chinese is
404       with the one phrase 'For your question, in Y directories you would find
405       X files'" -- and so the implemented solution should be, isomorphically,
406       just a straightforward way to spit out that one phrase, with numerals
407       properly interpolated.  It shouldn't have to map from the complexity of
408       other languages to the simplicity of this one.
409
410       Buzzword: Inheritance
411
412       There's a great deal of reuse possible for sharing of phrases between
413       modules for related dialects, or for sharing of auxiliary functions
414       between related languages.  (By "auxiliary functions", I mean functions
415       that don't produce phrase-text, but which, say, return an answer to
416       "does this number require a plural noun after it?".  Such auxiliary
417       functions would be used in the internal logic of functions that actu‐
418       ally do produce phrase-text.)
419
420       In the case of sharing phrases, consider that you have an interface
421       already localized for American English (probably by having been written
422       with that as the native locale, but that's incidental).  Localizing it
423       for UK English should, in practical terms, be just a matter of running
424       it past a British person with the instructions to indicate what few
425       phrases would benefit from a change in spelling or possibly minor
426       rewording.  In that case, you should be able to put in the UK English
427       localization module only those phrases that are UK-specific, and for
428       all the rest, inherit from the American English module.  (And I expect
429       this same situation would apply with Brazilian and Continental Por‐
430       tugese, possbily with some very closely related languages like Czech
431       and Slovak, and possibly with the slightly different "versions" of
432       written Mandarin Chinese, as I hear exist in Taiwan and mainland
433       China.)
434
435       As to sharing of auxiliary functions, consider the problem of Russian
436       numbers from the beginning of this article; obviously, you'd want to
437       write only once the hairy code that, given a numeric value, would
438       return some specification of which case and number a given quanitified
439       noun should use.  But suppose that you discover, while localizing an
440       interface for, say, Ukranian (a Slavic language related to Russian,
441       spoken by several million people, many of whom would be relieved to
442       find that your Web site's or software's interface is available in their
443       language), that the rules in Ukranian are the same as in Russian for
444       quantification, and probably for many other grammatical functions.
445       While there may well be no phrases in common between Russian and Ukra‐
446       nian, you could still choose to have the Ukranian module inherit from
447       the Russian module, just for the sake of inheriting all the various
448       grammatical methods.  Or, probably better organizationally, you could
449       move those functions to a module called "_E_Slavic" or something, which
450       Russian and Ukranian could inherit useful functions from, but which
451       would (presumably) provide no lexicon.
452
453       Buzzword: Concision
454
455       Okay, concision isn't a buzzword.  But it should be, so I decree that
456       as a new buzzword, "concision" means that simple common things should
457       be expressible in very few lines (or maybe even just a few characters)
458       of code -- call it a special case of "making simple things easy and
459       hard things possible", and see also the role it played in the
460       MIDI::Simple language, discussed elsewhere in this issue [TPJ#13].
461
462       Consider our first stab at an entry in our "phrasebook of functions":
463
464         sub I_found_X1_files_in_X2_directories {
465           my( $files, $dirs ) = @_[0,1];
466           $files = sprintf("%g %s", $files,
467             $files == 1 ? 'fichier' : 'fichiers');
468           $dirs = sprintf("%g %s", $dirs,
469             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
470           return "J'ai trouv\xE9 $files dans $dirs.";
471         }
472
473       You may sense that a lexicon (to use a non-committal catch-all term for
474       a collection of things you know how to say, regardless of whether
475       they're phrases or words) consisting of functions expressed as above
476       would make for rather long-winded and repetitive code -- even if you
477       wisely rewrote this to have quantification (as we call adding a number
478       expression to a noun phrase) be a function called like:
479
480         sub I_found_X1_files_in_X2_directories {
481           my( $files, $dirs ) = @_[0,1];
482           $files = quant($files, "fichier");
483           $dirs =  quant($dirs,  "r\xE9pertoire");
484           return "J'ai trouv\xE9 $files dans $dirs.";
485         }
486
487       And you may also sense that you do not want to bother your translators
488       with having to write Perl code -- you'd much rather that they spend
489       their very costly time on just translation.  And this is to say nothing
490       of the near impossibility of finding a commercial translator who would
491       know even simple Perl.
492
493       In a first-hack implementation of Maketext, each language-module's lex‐
494       icon looked like this:
495
496        %Lexicon = (
497          "I found %g files in %g directories"
498          => sub {
499             my( $files, $dirs ) = @_[0,1];
500             $files = quant($files, "fichier");
501             $dirs =  quant($dirs,  "r\xE9pertoire");
502             return "J'ai trouv\xE9 $files dans $dirs.";
503           },
504         ... and so on with other phrase => sub mappings ...
505        );
506
507       but I immediately went looking for some more concise way to basically
508       denote the same phrase-function -- a way that would also serve to con‐
509       cisely denote most phrase-functions in the lexicon for most languages.
510       After much time and even some actual thought, I decided on this system:
511
512       * Where a value in a %Lexicon hash is a contentful string instead of an
513       anonymous sub (or, conceivably, a coderef), it would be interpreted as
514       a sort of shorthand expression of what the sub does.  When accessed for
515       the first time in a session, it is parsed, turned into Perl code, and
516       then eval'd into an anonymous sub; then that sub replaces the original
517       string in that lexicon.  (That way, the work of parsing and evaling the
518       shorthand form for a given phrase is done no more than once per ses‐
519       sion.)
520
521       * Calls to "maketext" (as Maketext's main function is called) happen
522       thru a "language session handle", notionally very much like an IO han‐
523       dle, in that you open one at the start of the session, and use it for
524       "sending signals" to an object in order to have it return the text you
525       want.
526
527       So, this:
528
529         $lang->maketext("You have [quant,_1,piece] of new mail.",
530                        scalar(@messages));
531
532       basically means this: look in the lexicon for $lang (which may inherit
533       from any number of other lexicons), and find the function that we hap‐
534       pen to associate with the string "You have [quant,_1,piece] of new
535       mail" (which is, and should be, a functioning "shorthand" for this
536       function in the native locale -- English in this case).  If you find
537       such a function, call it with $lang as its first parameter (as if it
538       were a method), and then a copy of scalar(@messages) as its second, and
539       then return that value.  If that function was found, but was in string
540       shorthand instead of being a fully specified function, parse it and
541       make it into a function before calling it the first time.
542
543       * The shorthand uses code in brackets to indicate method calls that
544       should be performed.  A full explanation is not in order here, but a
545       few examples will suffice:
546
547         "You have [quant,_1,piece] of new mail."
548
549       The above code is shorthand for, and will be interpreted as, this:
550
551         sub {
552           my $handle = $_[0];
553           my(@params) = @_;
554           return join '',
555             "You have ",
556             $handle->quant($params[1], 'piece'),
557             "of new mail.";
558         }
559
560       where "quant" is the name of a method you're using to quantify the noun
561       "piece" with the number $params[0].
562
563       A string with no brackety calls, like this:
564
565         "Your search expression was malformed."
566
567       is somewhat of a degerate case, and just gets turned into:
568
569         sub { return "Your search expression was malformed." }
570
571       However, not everything you can write in Perl code can be written in
572       the above shorthand system -- not by a long shot.  For example, con‐
573       sider the Italian translator from the beginning of this article, who
574       wanted the Italian for "I didn't find any files" as a special case,
575       instead of "I found 0 files".  That couldn't be specified (at least not
576       easily or simply) in our shorthand system, and it would have to be
577       written out in full, like this:
578
579         sub {  # pretend the English strings are in Italian
580           my($handle, $files, $dirs) = @_[0,1,2];
581           return "I didn't find any files" unless $files;
582           return join '',
583             "I found ",
584             $handle->quant($files, 'file'),
585             " in ",
586             $handle->quant($dirs,  'directory'),
587             ".";
588         }
589
590       Next to a lexicon full of shorthand code, that sort of sticks out like
591       a sore thumb -- but this is a special case, after all; and at least
592       it's possible, if not as concise as usual.
593
594       As to how you'd implement the Russian example from the beginning of the
595       article, well, There's More Than One Way To Do It, but it could be
596       something like this (using English words for Russian, just so you know
597       what's going on):
598
599         "I [quant,_1,directory,accusative] scanned."
600
601       This shifts the burden of complexity off to the quant method.  That
602       method's parameters are: the numeric value it's going to use to quan‐
603       tify something; the Russian word it's going to quantify; and the param‐
604       eter "accusative", which you're using to mean that this sentence's syn‐
605       tax wants a noun in the accusative case there, although that quantifi‐
606       cation method may have to overrule, for grammatical reasons you may
607       recall from the beginning of this article.
608
609       Now, the Russian quant method here is responsible not only for imple‐
610       menting the strange logic necessary for figuring out how Russian num‐
611       ber-phrases impose case and number on their noun-phrases, but also for
612       inflecting the Russian word for "directory".  How that inflection is to
613       be carried out is no small issue, and among the solutions I've seen,
614       some (like variations on a simple lookup in a hash where all possible
615       forms are provided for all necessary words) are straightforward but can
616       become cumbersome when you need to inflect more than a few dozen words;
617       and other solutions (like using algorithms to model the inflections,
618       storing only root forms and irregularities) can involve more overhead
619       than is justifiable for all but the largest lexicons.
620
621       Mercifully, this design decision becomes crucial only in the hairiest
622       of inflected languages, of which Russian is by no means the worst case
623       scenario, but is worse than most.  Most languages have simpler inflec‐
624       tion systems; for example, in English or Swahili, there are generally
625       no more than two possible inflected forms for a given noun
626       ("error/errors"; "kosa/makosa"), and the rules for producing these
627       forms are fairly simple -- or at least, simple rules can be formulated
628       that work for most words, and you can then treat the exceptions as just
629       "irregular", at least relative to your ad hoc rules.  A simpler inflec‐
630       tion system (simpler rules, fewer forms) means that design decisions
631       are less crucial to maintaining sanity, whereas the same decisions
632       could incur overhead-versus-scalability problems in languages like Rus‐
633       sian.  It may also be likely that code (possibly in Perl, as with Lin‐
634       gua::EN::Inflect, for English nouns) has already been written for the
635       language in question, whether simple or complex.
636
637       Moreover, a third possibility may even be simpler than anything dis‐
638       cussed above: "Just require that all possible (or at least applicable)
639       forms be provided in the call to the given language's quant method, as
640       in:"
641
642         "I found [quant,_1,file,files]."
643
644       That way, quant just has to chose which form it needs, without having
645       to look up or generate anything.  While possibly not optimal for Rus‐
646       sian, this should work well for most other languages, where quantifica‐
647       tion is not as complicated an operation.
648
649       The Devil in the Details
650
651       There's plenty more to Maketext than described above -- for example,
652       there's the details of how language tags ("en-US", "i-pwn", "fi", etc.)
653       or locale IDs ("en_US") interact with actual module naming ("Bogo‐
654       Query/Locale/en_us.pm"), and what magic can ensue; there's the details
655       of how to record (and possibly negotiate) what character encoding Make‐
656       text will return text in (UTF8? Latin-1? KOI8?).  There's the interest‐
657       ing fact that Maketext is for localization, but nowhere actually has a
658       ""use locale;"" anywhere in it.  For the curious, there's the somewhat
659       frightening details of how I actually implement something like data
660       inheritance so that searches across modules' %Lexicon hashes can paral‐
661       lel how Perl implements method inheritance.
662
663       And, most importantly, there's all the practical details of how to
664       actually go about deriving from Maketext so you can use it for your
665       interfaces, and the various tools and conventions for starting out and
666       maintaining individual language modules.
667
668       That is all covered in the documentation for Locale::Maketext and the
669       modules that come with it, available in CPAN.  After having read this
670       article, which covers the why's of Maketext, the documentation, which
671       covers the how's of it, should be quite straightfoward.
672
673       The Proof in the Pudding: Localizing Web Sites
674
675       Maketext and gettext have a notable difference: gettext is in C, acces‐
676       sible thru C library calls, whereas Maketext is in Perl, and really
677       can't work without a Perl interpreter (although I suppose something
678       like it could be written for C).  Accidents of history (and not neces‐
679       sarily lucky ones) have made C++ the most common language for the
680       implementation of applications like word processors, Web browsers, and
681       even many in-house applications like custom query systems.  Current
682       conditions make it somewhat unlikely that the next one of any of these
683       kinds of applications will be written in Perl, albeit clearly more for
684       reasons of custom and inertia than out of consideration of what is the
685       right tool for the job.
686
687       However, other accidents of history have made Perl a well-accepted lan‐
688       guage for design of server-side programs (generally in CGI form) for
689       Web site interfaces.  Localization of static pages in Web sites is
690       trivial, feasable either with simple language-negotiation features in
691       servers like Apache, or with some kind of server-side inclusions of
692       language-appropriate text into layout templates.  However, I think that
693       the localization of Perl-based search systems (or other kinds of
694       dynamic content) in Web sites, be they public or access-restricted, is
695       where Maketext will see the greatest use.
696
697       I presume that it would be only the exceptional Web site that gets
698       localized for English and Chinese and Italian and Arabic and Russian,
699       to recall the languages from the beginning of this article -- to say
700       nothing of German, Spanish, French, Japanese, Finnish, and Hindi, to
701       name a few languages that benefit from large numbers of programmers or
702       Web viewers or both.
703
704       However, the ever-increasing internationalization of the Web (whether
705       measured in terms of amount of content, of numbers of content writers
706       or programmers, or of size of content audiences) makes it increasingly
707       likely that the interface to the average Web-based dynamic content ser‐
708       vice will be localized for two or maybe three languages.  It is my hope
709       that Maketext will make that task as simple as possible, and will
710       remove previous barriers to localization for languages dissimilar to
711       English.
712
713        __END__
714
715       Sean M. Burke (sburke@cpan.org) has a Master's in linguistics from
716       Northwestern University; he specializes in language technology.  Jordan
717       Lachler (lachler@unm.edu) is a PhD student in the Department of Lin‐
718       guistics at the University of New Mexico; he specializes in morphology
719       and pedagogy of North American native languages.
720
721       References
722
723       Alvestrand, Harald Tveit.  1995.  RFC 1766: Tags for the Identification
724       of Languages.  "ftp://ftp.isi.edu/in-notes/rfc1766.txt" [Now see RFC
725       3066.]
726
727       Callon, Ross, editor.  1996.  RFC 1925: The Twelve Networking Truths.
728       "ftp://ftp.isi.edu/in-notes/rfc1925.txt"
729
730       Drepper, Ulrich, Peter Miller, and Francois Pinard.  1995-2001.  GNU
731       "gettext".  Available in "ftp://prep.ai.mit.edu/pub/gnu/", with exten‐
732       sive docs in the distribution tarball.  [Since I wrote this article in
733       1998, I now see that the gettext docs are now trying more to come to
734       terms with plurality.  Whether useful conclusions have come from it is
735       another question altogether. -- SMB, May 2001]
736
737       Forbes, Nevill.  1964.  Russian Grammar.  Third Edition, revised by J.
738       C. Dumbreck.  Oxford University Press.
739
740
741
742perl v5.8.8                       2001-09-21      Locale::Maketext::TPJ13(3pm)
Impressum