1PO4A-GETTEXTIZE.1P(1) User Contributed Perl DocumentationPO4A-GETTEXTIZE.1P(1)
2
3
4
6 po4a-gettextize - convert an original file (and its translation) to a
7 PO file
8
10 po4a-gettextize -f fmt -m master.doc -l XX.doc -p XX.po
11
12 (XX.po is the output, all others are inputs)
13
15 po4a (PO for anything) eases the maintenance of documentation
16 translation using the classical gettext tools. The main feature of po4a
17 is that it decouples the translation of content from its document
18 structure. Please refer to the page po4a(7) for a gentle introduction
19 to this project.
20
21 The po4a-gettextize script helps you converting your previously
22 existing translations into a po4a-based workflow. This is only to be
23 done once to salvage an existing translation while converting to po4a,
24 not on a regular basis after the conversion of your project. This
25 tedious process is explained in details in Section 'Converting a manual
26 translation to po4a' below.
27
28 You must provide both a master file (e.g., the source in English) and
29 an existing translated file (e.g., a previous translation attempt
30 without po4a). If you provide more than one master or translation
31 files, they will be used in sequence, but it may be easier to
32 gettextize each page or chapter separately and then use msgmerge to
33 merge all produced PO files. As you wish.
34
35 If the master document has non-ASCII characters, the new generated PO
36 file will be in UTF-8. If the master document is completely in ASCII,
37 the generated PO will use the encoding of the translated input
38 document.
39
41 -f, --format
42 Format of the documentation you want to handle. Use the
43 --help-format option to see the list of available formats.
44
45 -m, --master
46 File containing the master document to translate. You can use this
47 option multiple times if you want to gettextize multiple documents.
48
49 -M, --master-charset
50 Charset of the file containing the document to translate.
51
52 -l, --localized
53 File containing the localized (translated) document. If you
54 provided multiple master files, you may wish to provide multiple
55 localized file by using this option more than once.
56
57 -L, --localized-charset
58 Charset of the file containing the localized document.
59
60 -p, --po
61 File where the message catalog should be written. If not given, the
62 message catalog will be written to the standard output.
63
64 -o, --option
65 Extra option(s) to pass to the format plugin. See the documentation
66 of each plugin for more information about the valid options and
67 their meanings. For example, you could pass '-o tablecells' to the
68 AsciiDoc parser, while the text parser would accept '-o
69 tabs=split'.
70
71 -h, --help
72 Show a short help message.
73
74 --help-format
75 List the documentation formats understood by po4a.
76
77 -k --keep-temps
78 Keep the temporary master and localized POT files built before
79 merging. This can be useful to understand why these files get
80 desynchronized, leading to gettextization problems
81
82 -V, --version
83 Display the version of the script and exit.
84
85 -v, --verbose
86 Increase the verbosity of the program.
87
88 -d, --debug
89 Output some debugging information.
90
91 --msgid-bugs-address email@address
92 Set the report address for msgid bugs. By default, the created POT
93 files have no Report-Msgid-Bugs-To fields.
94
95 --copyright-holder string
96 Set the copyright holder in the POT header. The default value is
97 "Free Software Foundation, Inc."
98
99 --package-name string
100 Set the package name for the POT header. The default is "PACKAGE".
101
102 --package-version string
103 Set the package version for the POT header. The default is
104 "VERSION".
105
106 Converting a manual translation to po4a
107 po4a-gettextize synchronizes the master and localized files to extract
108 their content into a PO file. The content of the master file gives the
109 msgid while the content of the localized file gives the msgstr. This
110 process is somewhat fragile: the Nth string of the translated file is
111 supposed to be the translation of the Nth string in the original.
112
113 Gettextization works best if you manage to retrieve the exact version
114 of the original document that was used for translation. Even so, you
115 may need to fiddle with both master and localized files to align their
116 structure if it was changed by the original translator, so working on
117 files' copies is advised.
118
119 Internally, each po4a parser reports the syntactical type of each
120 extracted strings. This is how desynchronization are detected during
121 the gettextization. In the example depicted below, it is very unlikely
122 that the 4th string in translation (of type 'chapter') is the
123 translation of the 4th string in original (of type 'paragraph'). It is
124 more likely that a new paragraph was added to the original, or that two
125 original paragraphs were merged together in the translation.
126
127 Original Translation
128
129 chapter chapter
130 paragraph paragraph
131 paragraph paragraph
132 paragraph chapter
133 chapter paragraph
134 paragraph paragraph
135
136 po4a-gettextize will verbosely diagnose any structure
137 desynchronization. When this happens, you should manually edit the
138 files to add fake paragraphs or remove some content here and there
139 until the structure of both files actually match. Some tricks are given
140 below to salvage the most of the existing translation while doing so.
141
142 If you are lucky enough to have a perfect match in the file structures
143 out of the box, building a correct PO file is a matter of seconds.
144 Otherwise, you will soon understand why this process has such an ugly
145 name :) Even so, gettextization often remains faster than translating
146 everything again. I gettextized the French translation of the whole
147 Perl documentation in one day despite the many synchronization issues.
148 Given the amount of text (2Mb of original text), restarting the
149 translation without first salvaging the old translations would have
150 required several months of work. In addition, this grunt work is the
151 price to pay to get the comfort of po4a. Once converted, the
152 synchronization between master documents and translations will always
153 be fully automatic.
154
155 After a successful gettextization, the produced documents should be
156 manually checked for undetected disparities and silent errors, as
157 explained below.
158
159 Hints and tricks for the gettextization process
160
161 The gettextization stops as soon as a desynchronization is detected.
162 When this happens, you need to edit the files as much as needed to re-
163 align the files' structures. po4a-gettextize is rather verbose when
164 things go wrong. It reports the strings that don't match, their
165 positions in the text, and the type of each of them. Moreover, the PO
166 file generated so far is dumped as gettextization.failed.po for further
167 inspection.
168
169 Here are some tricks to help you in this tedious process and ensure
170 that you salvage the most of the previous translation:
171
172 • Remove all extra content of the translations, such as the section
173 giving credits to the translators. They should be added separately
174 to po4a as addendas (see po4a(7)).
175
176 • When editing the files to align their structures, prefer editing
177 the translation if possible. Indeed, if the changes to the original
178 are too intrusive, the old and new versions will not be matched
179 during the first po4a run after gettextization (see below). Any
180 unmatched translation will be dumped anyway. That being said, you
181 still want to edit the original document if it's too hard to get
182 the gettextization to proceed otherwise, even if it means that one
183 paragraph of the translation is dumped. The important thing is to
184 get a first PO file to start with.
185
186 • Do not hesitate to kill any original content that would not exist
187 in the translated version. This content will be automatically
188 reintroduced afterward, when synchronizing the PO file with the
189 document.
190
191 • You should probably inform the original author of any structural
192 change in the translation that seems justified. Issues in the
193 original document should reported to the author. Fixing them in
194 your translation only fixes them for a part of the community. Plus,
195 it is impossible to do so when using po4a ;) But you probably want
196 to wait until the end of the conversion to po4a before changing the
197 original files.
198
199 • Sometimes, the paragraph content does match, but not their types.
200 Fixing it is rather format-dependent. In POD and man, it often
201 comes from the fact that one of them contains a line beginning with
202 a white space while the other does not. In those formats, such
203 paragraph cannot be wrapped and thus become a different type. Just
204 remove the space and you are fine. It may also be a typo in the tag
205 name in XML.
206
207 Likewise, two paragraphs may get merged together in POD when the
208 separating line contains some spaces, or when there is no empty
209 line between the =item line and the content of the item.
210
211 • Sometimes, the desynchronization message seems odd because the
212 translation is attached to the wrong original paragraph. It is the
213 sign of an undetected issue earlier in the process. Search for the
214 actual desynchronization point by inspecting the file
215 gettextization.failed.po that was produced, and fix the problem
216 where it really is.
217
218 • Other issues may come from duplicated strings in either the
219 original or translation. Duplicated strings are merged in PO files,
220 with two references. This constitutes a difficulty for the
221 gettextization algorithm, that is a simple one to one pairing
222 between the msgids of both the master and the localized files. It
223 is however believed that recent versions of po4a deal properly with
224 duplicated strings, so you should report any remaining issue that
225 you may encounter.
226
227 Reviewing files produced by po4a-gettextize
228 Any file produced by po4a-gettextize should be manually reviewed, even
229 when the script terminates successfully. You should skim over the PO
230 file, ensuring that the msgid and msgstr actually match. It is not
231 necessary to ensure that the translation is perfectly correct yet, as
232 all entries are marked as fuzzy translations anyway. You only need to
233 check for obvious matching issues because badly matched translations
234 will be dumped in subsequent steps while you want to salvage them.
235
236 Fortunately, this step does not require to master the target languages
237 as you only want to recognize similar elements in each msgid and its
238 corresponding msgstr. As a speaker of French, English, and some German
239 myself, I can do this for all European languages at least, even if I
240 cannot say one word of most of these languages. I sometimes manage to
241 detect matching issues in non-Latin languages by looking at string
242 length, phrase structures (does the amount of interrogation marks
243 match?) and other clues, but I prefer when someone else can review
244 those languages.
245
246 If you detect a mismatch, edit the original and translation files as if
247 po4a-gettextize reported an error, and try again. Once you have a
248 decent PO file for your previous translation, backup it until you get
249 po4a working correctly.
250
251 Running po4a for the first time
252 The easiest way to setup po4a is to write a po4a.conf configuration
253 file, and use the integrated po4a program (po4a-updatepo and
254 po4a-translate are deprecated). Please check the "CONFIGURATION FILE"
255 Section in po4a(1) documentation for more details.
256
257 When po4a runs for the first time, the current version of the master
258 documents will be used to update the PO files containing the old
259 translations that you salvaged through gettextization. This can take
260 quite a long time, because many of the msgids of from the
261 gettextization do not exactly match the elements of the POT file built
262 from the recent master files. This forces gettext to search for the
263 closest one using a costly string proximity algorithm. For example,
264 the first run over the Perl documentation's French translation (5.5 MB
265 PO file) took about 48 hours (yes, two days) while the subsequent ones
266 only take seconds.
267
268 Moving your translations to production
269 After this first run, the PO files are ready to be reviewed by
270 translators. All entries were marked as fuzzy in the PO file by
271 po4a-gettextization, forcing their careful review before use.
272 Translators should take each entry to verify that the salvaged
273 translation actually match the current original text, update the
274 translation on need, and remove the fuzzy markers.
275
276 Once enough fuzzy markers are removed, po4a will start generating the
277 translation files on disk, and you're ready to move your translation
278 workflow to production. Some projects find it useful to rely on weblate
279 to coordinate between translators and maintainers, but that's beyond
280 po4a' scope.
281
283 po4a(1), po4a-normalize(1), po4a-translate(1), po4a-updatepo(1),
284 po4a(7).
285
287 Denis Barbier <barbier@linuxfr.org>
288 Nicolas François <nicolas.francois@centraliens.net>
289 Martin Quinson (mquinson#debian.org)
290
292 Copyright 2002-2022 by SPI, inc.
293
294 This program is free software; you may redistribute it and/or modify it
295 under the terms of GPL (see the COPYING file).
296
297
298
299perl v5.36.0 2023-01-23 PO4A-GETTEXTIZE.1P(1)