1PO4A-GETTEXTIZE(1p)               Po4a Tools               PO4A-GETTEXTIZE(1p)
2
3
4

NAME

6       po4a-gettextize - convert an original file (and its translation) to a
7       PO file
8

SYNOPSIS

10       po4a-gettextize -f fmt -m master.doc [-l XX.doc] -p XX.po
11
12       (XX.po is the output, all others are inputs)
13

DESCRIPTION

15       po4a (PO for anything) eases the maintenance of documentation
16       translation using the classical gettext tools. The main feature of po4a
17       is that it decouples the translation of content from its document
18       structure.  Please refer to the page po4a(7) for a gentle introduction
19       to this project.
20
21       The po4a-gettextize script is in charge of converting documentation
22       files into PO files. You only need it to setup your translation project
23       with po4a, never afterward.
24
25       If you start from scratch, po4a-gettextize will extract the
26       translatable strings from the documentation and write a POT file. If
27       you provide a previously existing translated file with the -l flag,
28       po4a-gettextize will try to use the translations that it contains in
29       the produced PO file. This process remains tedious and manual, as
30       explained in Section 'Converting a manual translation to po4a' below.
31
32       If the master document has non-ASCII characters, the new generated PO
33       file will be in UTF-8. Else (if the master document is completely in
34       ASCII), the generated PO will use the encoding of the translated input
35       document, or UTF-8 if no translated document is provided.
36

OPTIONS

38       -f, --format
39           Format of the documentation you want to handle. Use the
40           --help-format option to see the list of available formats.
41
42       -m, --master
43           File containing the master document to translate. You can use this
44           option multiple times if you want to gettextize multiple documents.
45
46       -M, --master-charset
47           Charset of the file containing the document to translate.
48
49       -l, --localized
50           File containing the localized (translated) document. If you
51           provided multiple master files, you may wish to provide multiple
52           localized file by using this option more than once.
53
54       -L, --localized-charset
55           Charset of the file containing the localized document.
56
57       -p, --po
58           File where the message catalog should be written. If not given, the
59           message catalog will be written to the standard output.
60
61       -o, --option
62           Extra option(s) to pass to the format plugin. See the documentation
63           of each plugin for more information about the valid options and
64           their meanings. For example, you could pass '-o tablecells' to the
65           AsciiDoc parser, while the text parser would accept '-o
66           tabs=split'.
67
68       -h, --help
69           Show a short help message.
70
71       --help-format
72           List the documentation formats understood by po4a.
73
74       -V, --version
75           Display the version of the script and exit.
76
77       -v, --verbose
78           Increase the verbosity of the program.
79
80       -d, --debug
81           Output some debugging information.
82
83       --msgid-bugs-address email@address
84           Set the report address for msgid bugs. By default, the created POT
85           files have no Report-Msgid-Bugs-To fields.
86
87       --copyright-holder string
88           Set the copyright holder in the POT header. The default value is
89           "Free Software Foundation, Inc."
90
91       --package-name string
92           Set the package name for the POT header. The default is "PACKAGE".
93
94       --package-version string
95           Set the package version for the POT header. The default is
96           "VERSION".
97
98   Converting a manual translation to po4a
99       po4a-gettextize will try to extract the content of any provided
100       translation file, and use this content as msgstr in the produced PO
101       file. Be warned that this process is very fragile: the Nth string of
102       the translated file is supposed to be the translation of the Nth string
103       in the original. This will naturally not work unless both files share
104       exactly the same structure.
105
106       Internally, each po4a parser reports the syntactical type of each
107       extracted strings. This is how desynchronization are detected during
108       the gettextization.  For example, if the files have the following
109       structure, it is very unlikely that the 4th string in translation (of
110       type 'chapter') is the translation of the 4th string in original (of
111       type 'paragraph'). It is more likely that a new paragraph was added to
112       the original, or that two original paragraphs were merged together in
113       the translation.
114
115           Original         Translation
116
117         chapter            chapter
118           paragraph          paragraph
119           paragraph          paragraph
120           paragraph        chapter
121         chapter              paragraph
122           paragraph          paragraph
123
124       po4a-gettextize will verbosely diagnose any detected structure
125       desynchronization. When this happens, you should manually edit the
126       files (this probably requires that you have some notions of the target
127       language). You must add fake paragraphs or remove some content in one
128       of the documents (or both) to fix the reported disparities, until the
129       structure of both documents perfectly match. Some tricks are given in
130       the next section.
131
132       Even when the document is successfully processed, undetected
133       disparities and silent errors are still possible. That is why any
134       translation associated automatically by po4a-gettextize is marked as
135       fuzzy to require an manual inspection by humans. One has to check that
136       each retrieved msgstr is actually the translation of the associated
137       msgid, and not the string before or after.
138
139       As you can see, the key here is to have the exact same structure in the
140       translated document and in the original one. The best is to do the
141       gettextization on the exact version of master.doc that was used for the
142       translation, and only update the PO file against the latest master file
143       once the gettextization was successful.
144
145       If you are lucky enough to have a a perfect match in the file
146       structures, building a correct PO file is a matter of seconds.
147       Otherwise, you will soon understand why this process has such an ugly
148       name :) But remember that this grunt work is the price to pay to get
149       the comfort of po4a afterward. Once converted, the synchronization
150       between master documents and translations will always be fully
151       automatic.
152
153       Even when things go wrong, gettextization often remains faster than
154       translating everything again. I was able to gettextize the existing
155       French translation of the whole Perl documentation in one day, even
156       though the structure of many documents were desynchronized. That was
157       more than two megabytes of original text (2 millions of characters):
158       restarting the translation from scratch would have required several
159       months of work.
160
161   Hints and tricks for the gettextization process
162       The gettextization stops as soon as a desynchronization is detected. In
163       theory, it should probably be possible resynchronize the gettextization
164       later in the documents using e.g. the same algorithm than the diff(1)
165       utility. But a manual intervention would still be mandatory to manually
166       match the elements that couldn't be automatically matched, explaining
167       why automatic resynchronization is not implemented (yet?).
168
169       When this happens, the whole game comes down to the alignment of these
170       damn files' structures again through manual edits. po4a-gettextize is
171       rather verbose about what went wrong when it happens. It reports the
172       strings that don't match, their positions in the text, and the type of
173       each of them. Moreover, the PO file generated so far is dumped as
174       gettextization.failed.po for further inspection.
175
176       Here are some other tricks to help you in this tedious process:
177
178       •   Remove all extra content of the translations, such as the section
179           giving credits to the translators. You can add them back in po4a
180           afterward, using an addenda (see po4a(7)).
181
182       •   If you need to edit the files to align their structures, you should
183           prefer editing the translation if possible. Indeed, if the changes
184           to the original are too intrusive, the old and new versions will
185           not be matched during the PO update, and the corresponding
186           translation will be dumped anyway. But do not hesitate to also edit
187           the original document if required: the important thing is to get a
188           first PO file to start with.
189
190       •   Do not hesitate to kill any original content that would not exist
191           in the translated version. This content will be automatically
192           reintroduced afterward, when synchronizing the PO file with the
193           document.
194
195       •   You should probably inform the original author of any structural
196           change in the translation that seems justified. Issues in the
197           original document should reported to the author. Fixing them in
198           your translation only fixes them for a part of the community. Plus,
199           it is impossible to do so when using po4a ;)
200
201       •   Sometimes, the paragraph content does match, but not their types.
202           Fixing it is rather format-dependent. In POD and man, it often
203           comes from the fact that one of them contains a line beginning with
204           a white space while the other does not.  In those formats, such
205           paragraph cannot be wrapped and thus become a different type. Just
206           remove the space and you are fine. It may also be a typo in the tag
207           name in XML.
208
209           Likewise, two paragraphs may get merged together in POD when the
210           separating line contains some spaces, or when there is no empty
211           line between the =item line and the content of the item.
212
213       •   Sometimes, the desynchronization message seems odd because the
214           translation is attached to the wrong original paragraph. It is the
215           sign of an undetected issue earlier in the process. Search for the
216           actual desynchronization point by inspecting
217           gettextization.failed.po, and fix the problem where it really is.
218
219       •   In some unfortunate settings, you will get the feeling that po4a
220           ate some parts of the text, either the original or the translation.
221           gettextization.failed.po indicates that both files matched as
222           expected up to the paragraph N. But then, an (unsuccessful) attempt
223           is made to match the N+1 paragraph in the original file not with
224           the N+1 paragraph in the translation as it should, but with the N+2
225           paragraph. Just as if the N+1 paragraph that you see in the
226           document simply disappeared from the file during the process.
227
228           This unfortunate situation happens when the same paragraph is
229           repeated over the document. In that case, no new entry is created
230           in the PO file, but a new reference is added to the existing one
231           instead.
232
233           So, the previous situation occurs when two similar but different
234           paragraphs are translated in the exact same way. This will
235           apparently remove a paragraph of the translation. To fix the
236           problem, it is sufficient to slightly alter one of the translations
237           in the document. You can also prefer to kill the second paragraph
238           in the original document.
239
240           To the opposite, if the same paragraph appearing twice in the
241           original document is not translated in the exact same way at both
242           locations, you will get the feeling that one paragraph of the
243           original document just vanished. Just copy the best translation
244           over the other one in the translated document to fix the problem.
245
246       •   As a final note, do not be too surprised if the first
247           synchronization of your PO file takes a long time. This is because
248           most of the msgid of the PO file resulting from the gettextization
249           don't match exactly any element of the POT file built from the
250           recent master files. This forces gettext to search for the closest
251           one using a costly string proximity algorithm.
252
253           For example, the first po4a-updatepo of the Perl documentation's
254           French translation (5.5 MB PO file) took about 48 hours (yes, two
255           days) while the subsequent ones only take a dozen of seconds.
256

SEE ALSO

258       po4a(1), po4a-normalize(1), po4a-translate(1), po4a-updatepo(1),
259       po4a(7).
260

AUTHORS

262        Denis Barbier <barbier@linuxfr.org>
263        Nicolas Francois <nicolas.francois@centraliens.net>
264        Martin Quinson (mquinson#debian.org)
265
267       Copyright 2002-2020 by SPI, inc.
268
269       This program is free software; you may redistribute it and/or modify it
270       under the terms of GPL (see the COPYING file).
271
272
273
274Po4a Tools                        2022-01-21               PO4A-GETTEXTIZE(1p)
Impressum