1PO4A-GETTEXTIZE(1p) Po4a Tools PO4A-GETTEXTIZE(1p)
2
3
4
6 po4a-gettextize - convert an original file (and its translation) to a
7 PO file
8
10 po4a-gettextize -f fmt -m master.doc [-l XX.doc] -p XX.po
11
12 (XX.po is the output, all others are inputs)
13
15 po4a (PO for anything) eases the maintenance of documentation
16 translation using the classical gettext tools. The main feature of po4a
17 is that it decouples the translation of content from its document
18 structure. Please refer to the page po4a(7) for a gentle introduction
19 to this project.
20
21 The po4a-gettextize script is in charge of converting documentation
22 files into PO files. You only need it to setup your translation project
23 with po4a, never afterward.
24
25 If you start from scratch, po4a-gettextize will extract the
26 translatable strings from the documentation and write a POT file. If
27 you provide a previously existing translated file with the -l flag,
28 po4a-gettextize will try to use the translations that it contains in
29 the produced PO file. This process remains tedious and manual, as
30 explained in Section 'Converting a manual translation to po4a' below.
31
32 If the master document has non-ASCII characters, the new generated PO
33 file will be in UTF-8. Else (if the master document is completely in
34 ASCII), the generated PO will use the encoding of the translated input
35 document, or UTF-8 if no translated document is provided.
36
38 -f, --format
39 Format of the documentation you want to handle. Use the
40 --help-format option to see the list of available formats.
41
42 -m, --master
43 File containing the master document to translate. You can use this
44 option multiple times if you want to gettextize multiple documents.
45
46 -M, --master-charset
47 Charset of the file containing the document to translate.
48
49 -l, --localized
50 File containing the localized (translated) document. If you
51 provided multiple master files, you may wish to provide multiple
52 localized file by using this option more than once.
53
54 -L, --localized-charset
55 Charset of the file containing the localized document.
56
57 -p, --po
58 File where the message catalog should be written. If not given, the
59 message catalog will be written to the standard output.
60
61 -o, --option
62 Extra option(s) to pass to the format plugin. See the documentation
63 of each plugin for more information about the valid options and
64 their meanings. For example, you could pass '-o tablecells' to the
65 AsciiDoc parser, while the text parser would accept '-o
66 tabs=split'.
67
68 -h, --help
69 Show a short help message.
70
71 --help-format
72 List the documentation formats understood by po4a.
73
74 -V, --version
75 Display the version of the script and exit.
76
77 -v, --verbose
78 Increase the verbosity of the program.
79
80 -d, --debug
81 Output some debugging information.
82
83 --msgid-bugs-address email@address
84 Set the report address for msgid bugs. By default, the created POT
85 files have no Report-Msgid-Bugs-To fields.
86
87 --copyright-holder string
88 Set the copyright holder in the POT header. The default value is
89 "Free Software Foundation, Inc."
90
91 --package-name string
92 Set the package name for the POT header. The default is "PACKAGE".
93
94 --package-version string
95 Set the package version for the POT header. The default is
96 "VERSION".
97
98 Converting a manual translation to po4a
99 po4a-gettextize will try to extract the content of any provided
100 translation file, and use this content as msgstr in the produced PO
101 file. Be warned that this process is very fragile: the Nth string of
102 the translated file is supposed to be the translation of the Nth string
103 in the original. This will naturally not work unless both files share
104 exactly the same structure.
105
106 Internally, each po4a parser reports the syntactical type of each
107 extracted strings. This is how desynchronization are detected during
108 the gettextization. For example, if the files have the following
109 structure, it is very unlikely that the 4th string in translation (of
110 type 'chapter') is the translation of the 4th string in original (of
111 type 'paragraph'). It is more likely that a new paragraph was added to
112 the original, or that two original paragraphs were merged together in
113 the translation.
114
115 Original Translation
116
117 chapter chapter
118 paragraph paragraph
119 paragraph paragraph
120 paragraph chapter
121 chapter paragraph
122 paragraph paragraph
123
124 po4a-gettextize will verbosely diagnose any detected structure
125 desynchronization. When this happens, you should manually edit the
126 files (this probably requires that you have some notions of the target
127 language). You must add fake paragraphs or remove some content in one
128 of the documents (or both) to fix the reported disparities, until the
129 structure of both documents perfectly match. Some tricks are given in
130 the next section.
131
132 Even when the document is successfully processed, undetected
133 disparities and silent errors are still possible. That is why any
134 translation associated automatically by po4a-gettextize is marked as
135 fuzzy to require an manual inspection by humans. One has to check that
136 each retrieved msgstr is actually the translation of the associated
137 msgid, and not the string before or after.
138
139 As you can see, the key here is to have the exact same structure in the
140 translated document and in the original one. The best is to do the
141 gettextization on the exact version of master.doc that was used for the
142 translation, and only update the PO file against the latest master file
143 once the gettextization was successful.
144
145 If you are lucky enough to have a a perfect match in the file
146 structures, building a correct PO file is a matter of seconds.
147 Otherwise, you will soon understand why this process has such an ugly
148 name :) But remember that this grunt work is the price to pay to get
149 the comfort of po4a afterward. Once converted, the synchronization
150 between master documents and translations will always be fully
151 automatic.
152
153 Even when things go wrong, gettextization often remains faster than
154 translating everything again. I was able to gettextize the existing
155 French translation of the whole Perl documentation in one day, even
156 though the structure of many documents were desynchronized. That was
157 more than two megabytes of original text (2 millions of characters):
158 restarting the translation from scratch would have required several
159 months of work.
160
161 Hints and tricks for the gettextization process
162 The gettextization stops as soon as a desynchronization is detected. In
163 theory, it should probably be possible resynchronize the gettextization
164 later in the documents using e.g. the same algorithm than the diff(1)
165 utility. But a manual intervention would still be mandatory to manually
166 match the elements that couldn't be automatically matched, explaining
167 why automatic resynchronization is not implemented (yet?).
168
169 When this happens, the whole game comes down to the alignment of these
170 damn files' structures again through manual edits. po4a-gettextize is
171 rather verbose about what went wrong when it happens. It reports the
172 strings that don't match, their positions in the text, and the type of
173 each of them. Moreover, the PO file generated so far is dumped as
174 gettextization.failed.po for further inspection.
175
176 Here are some other tricks to help you in this tedious process:
177
178 • Remove all extra content of the translations, such as the section
179 giving credits to the translators. You can add them back in po4a
180 afterward, using an addenda (see po4a(7)).
181
182 • If you need to edit the files to align their structures, you should
183 prefer editing the translation if possible. Indeed, if the changes
184 to the original are too intrusive, the old and new versions will
185 not be matched during the PO update, and the corresponding
186 translation will be dumped anyway. But do not hesitate to also edit
187 the original document if required: the important thing is to get a
188 first PO file to start with.
189
190 • Do not hesitate to kill any original content that would not exist
191 in the translated version. This content will be automatically
192 reintroduced afterward, when synchronizing the PO file with the
193 document.
194
195 • You should probably inform the original author of any structural
196 change in the translation that seems justified. Issues in the
197 original document should reported to the author. Fixing them in
198 your translation only fixes them for a part of the community. Plus,
199 it is impossible to do so when using po4a ;)
200
201 • Sometimes, the paragraph content does match, but not their types.
202 Fixing it is rather format-dependent. In POD and man, it often
203 comes from the fact that one of them contains a line beginning with
204 a white space while the other does not. In those formats, such
205 paragraph cannot be wrapped and thus become a different type. Just
206 remove the space and you are fine. It may also be a typo in the tag
207 name in XML.
208
209 Likewise, two paragraphs may get merged together in POD when the
210 separating line contains some spaces, or when there is no empty
211 line between the =item line and the content of the item.
212
213 • Sometimes, the desynchronization message seems odd because the
214 translation is attached to the wrong original paragraph. It is the
215 sign of an undetected issue earlier in the process. Search for the
216 actual desynchronization point by inspecting
217 gettextization.failed.po, and fix the problem where it really is.
218
219 • In some unfortunate settings, you will get the feeling that po4a
220 ate some parts of the text, either the original or the translation.
221 gettextization.failed.po indicates that both files matched as
222 expected up to the paragraph N. But then, an (unsuccessful) attempt
223 is made to match the N+1 paragraph in the original file not with
224 the N+1 paragraph in the translation as it should, but with the N+2
225 paragraph. Just as if the N+1 paragraph that you see in the
226 document simply disappeared from the file during the process.
227
228 This unfortunate situation happens when the same paragraph is
229 repeated over the document. In that case, no new entry is created
230 in the PO file, but a new reference is added to the existing one
231 instead.
232
233 So, the previous situation occurs when two similar but different
234 paragraphs are translated in the exact same way. This will
235 apparently remove a paragraph of the translation. To fix the
236 problem, it is sufficient to slightly alter one of the translations
237 in the document. You can also prefer to kill the second paragraph
238 in the original document.
239
240 To the opposite, if the same paragraph appearing twice in the
241 original document is not translated in the exact same way at both
242 locations, you will get the feeling that one paragraph of the
243 original document just vanished. Just copy the best translation
244 over the other one in the translated document to fix the problem.
245
246 • As a final note, do not be too surprised if the first
247 synchronization of your PO file takes a long time. This is because
248 most of the msgid of the PO file resulting from the gettextization
249 don't match exactly any element of the POT file built from the
250 recent master files. This forces gettext to search for the closest
251 one using a costly string proximity algorithm.
252
253 For example, the first po4a-updatepo of the Perl documentation's
254 French translation (5.5 MB PO file) took about 48 hours (yes, two
255 days) while the subsequent ones only take a dozen of seconds.
256
258 po4a(1), po4a-normalize(1), po4a-translate(1), po4a-updatepo(1),
259 po4a(7).
260
262 Denis Barbier <barbier@linuxfr.org>
263 Nicolas François <nicolas.francois@centraliens.net>
264 Martin Quinson (mquinson#debian.org)
265
267 Copyright 2002-2020 by SPI, inc.
268
269 This program is free software; you may redistribute it and/or modify it
270 under the terms of GPL (see the COPYING file).
271
272
273
274Po4a Tools 2021-11-01 PO4A-GETTEXTIZE(1p)