1Text::Autoformat(3) User Contributed Perl Documentation Text::Autoformat(3)
2
3
4
6 Text::Autoformat - Automatic text wrapping and reformatting
7
9 This document describes version 1.13 of Text::Autoformat, released May
10 4, 2005.
11
13 # Minimal use: read from STDIN, format to STDOUT...
14
15 use Text::Autoformat;
16 autoformat;
17
18 # In-memory formatting...
19
20 $formatted = autoformat $rawtext;
21
22 # Configuration...
23
24 $formatted = autoformat $rawtext, { %options };
25
26 # Margins (1..72 by default)...
27
28 $formatted = autoformat $rawtext, { left=>8, right=>70 };
29
30 # Justification (left by default)...
31
32 $formatted = autoformat $rawtext, { justify => 'left' };
33 $formatted = autoformat $rawtext, { justify => 'right' };
34 $formatted = autoformat $rawtext, { justify => 'full' };
35 $formatted = autoformat $rawtext, { justify => 'centre' };
36
37 # Filling (does so by default)...
38
39 $formatted = autoformat $rawtext, { fill=>0 };
40
41 # Squeezing whitespace (does so by default)...
42
43 $formatted = autoformat $rawtext, { squeeze=>0 };
44
45 # Case conversions...
46
47 $formatted = autoformat $rawtext, { case => 'lower' };
48 $formatted = autoformat $rawtext, { case => 'upper' };
49 $formatted = autoformat $rawtext, { case => 'sentence' };
50 $formatted = autoformat $rawtext, { case => 'title' };
51 $formatted = autoformat $rawtext, { case => 'highlight' };
52
53 # Selective reformatting
54
55 $formatted = autoformat $rawtext, { ignore=>qr/^\t/ };
56
58 The problem
59
60 Perl plaintext formatters just aren't smart enough. Given a typical
61 piece of plaintext in need of formatting:
62
63 In comp.lang.perl.misc you wrote:
64 : > <CN = Clooless Noobie> writes:
65 : > CN> PERL sux because:
66 : > CN> * It doesn't have a switch statement and you have to put $
67 : > CN>signs in front of everything
68 : > CN> * There are too many OR operators: having ⎪, ⎪⎪ and 'or'
69 : > CN>operators is confusing
70 : > CN> * VB rools, yeah!!!!!!!!!
71 : > CN> So anyway, how can I stop reloads on a web page?
72 : > CN> Email replies only, thanks - I don't read this newsgroup.
73 : >
74 : > Begone, sirrah! You are a pathetic, Bill-loving, microcephalic
75 : > script-infant.
76 : Sheesh, what's with this group - ask a question, get toasted! And how
77 : *dare* you accuse me of Ianuphilia!
78
79 both the venerable Unix fmt tool and Perl's standard Text::Wrap module
80 produce:
81
82 In comp.lang.perl.misc you wrote: : > <CN = Clooless Noobie>
83 writes: : > CN> PERL sux because: : > CN> * It doesn't
84 have a switch statement and you have to put $ : > CN>signs in
85 front of everything : > CN> * There are too many OR
86 operators: having ⎪, ⎪⎪ and 'or' : > CN>operators is confusing
87 : > CN> * VB rools, yeah!!!!!!!!! : > CN> So anyway, how
88 can I stop reloads on a web page? : > CN> Email replies only,
89 thanks - I don't read this newsgroup. : > : > Begone, sirrah!
90 You are a pathetic, Bill-loving, microcephalic : >
91 script-infant. : Sheesh, what's with this group - ask a
92 question, get toasted! And how : *dare* you accuse me of
93 Ianuphilia!
94
95 Other formatting modules -- such as Text::Correct and Text::Format --
96 provide more control over their output, but produce equally poor
97 results when applied to arbitrary input. They simply don't understand
98 the structural conventions of the text they're reformatting.
99
100 The solution
101
102 The Text::Autoformat module provides a subroutine named "autoformat"
103 that wraps text to specified margins. However, "autoformat" reformats
104 its input by analysing the text's structure, so it wraps the above
105 example like so:
106
107 In comp.lang.perl.misc you wrote:
108 : > <CN = Clooless Noobie> writes:
109 : > CN> PERL sux because:
110 : > CN> * It doesn't have a switch statement and you
111 : > CN> have to put $ signs in front of everything
112 : > CN> * There are too many OR operators: having ⎪, ⎪⎪
113 : > CN> and 'or' operators is confusing
114 : > CN> * VB rools, yeah!!!!!!!!! So anyway, how can I
115 : > CN> stop reloads on a web page? Email replies
116 : > CN> only, thanks - I don't read this newsgroup.
117 : >
118 : > Begone, sirrah! You are a pathetic, Bill-loving,
119 : > microcephalic script-infant.
120 : Sheesh, what's with this group - ask a question, get toasted!
121 : And how *dare* you accuse me of Ianuphilia!
122
123 Note that the various quoting conventions have been observed. In fact,
124 their structure has been used to determine where some paragraphs begin.
125 Furthermore "autoformat" correctly distinguished between the leading
126 '*' bullets of the nested list (which were outdented) and the leading
127 emphatic '*' of "*dare*" (which was inlined).
128
130 Paragraphs
131
132 The fundamental task of the "autoformat" subroutine is to identify and
133 rearrange independent paragraphs in a text. Paragraphs typically con‐
134 sist of a series of lines containing at least one non-whitespace char‐
135 acter, followed by one or more lines containing only optional white‐
136 space. This is a more liberal definition than many other formatters
137 use: most require an empty line to terminate a paragraph. Paragraphs
138 may also be denoted by bulleting, numbering, or quoting (see the fol‐
139 lowing sections).
140
141 Once a paragraph has been isolated, "autoformat" fills and re-wraps its
142 lines according to the margins that are specified in its argument list.
143 These are placed after the text to be formatted, in a hash reference:
144
145 $tidied = autoformat($messy, {left=>20, right=>60});
146
147 By default, "autoformat" uses a left margin of 1 (first column) and a
148 right margin of 72.
149
150 You can also control whether (and how) "autoformat" breaks words at the
151 end of a line, using the 'break' option:
152
153 # Turn off all hyphenation
154 use Text::Autoformat qw(autoformat break_wrap);
155 $tidied = autoformat($messy, {break=>break_wrap});
156
157 # Default hyphenation
158 use Text::Autoformat qw(autoformat break_at);
159 $tidied = autoformat($messy, {break=>break_at('-')});
160
161 # Use TeX::Hyphen module's hyphenation (module must be installed)
162 use Text::Autoformat qw(autoformat break_TeX);
163 $tidied = autoformat($messy, {break=>break_TeX});
164
165 Normally, "autoformat" only reformats the first paragraph it encoun‐
166 ters, and leaves the remainder of the text unaltered. This behaviour is
167 useful because it allows a one-liner invoking the subroutine to be
168 mapped onto a convenient keystroke in a text editor, to provide one-
169 paragraph-at-a-time reformatting:
170
171 % cat .exrc
172
173 map f !Gperl -MText::Autoformat -e'autoformat'
174
175 (Note that to facilitate such one-liners, if "autoformat" is called in
176 a void context without any text data, it takes its text from "STDIN"
177 and writes its result to "STDOUT").
178
179 To enable "autoformat" to rearrange the entire input text at once, the
180 "all" argument is used:
181
182 $tidied_all = autoformat($messy, {left=>20, right=>60, all=>1});
183
184 "autoformat" can also be directed to selectively reformat paragraphs,
185 using the "ignore" argument:
186
187 $tidied_some = autoformat($messy, {ignore=>qr/^[ \t]/});
188
189 The value for "ignore" may be a "qr"'d regex, a subroutine reference,
190 or the special string 'indented'.
191
192 If a regex is specified, any paragraph whose original text matches that
193 regex will not be reformatted (i.e. it will be printed verbatim).
194
195 If a subroutine is specified, that subroutine will be called once for
196 each paragraph (with $_ set to the paragraph's text). The subroutine is
197 expected to return a true or false value. If it returns true, the para‐
198 graph will not be reformatted.
199
200 If the value of the "ignore" option is the string 'indented', "autofor‐
201 mat" will ignore any paragraph in which every line begins with a white‐
202 space.
203
204 One other special case of ignorance is ignoring mail headers and signa‐
205 ture. This option is specified using the "mail" argument:
206
207 $tidied_mesg = autoformat($messy_mesg, {mail=>1});
208
209 Note that the "mail" option automatically implies "all".
210
211 Bulleting and (re-)numbering
212
213 Often plaintext will include lists that are either:
214
215 * bulleted,
216 * simply numbered (i.e. 1., 2., 3., etc.), or
217 * hierarchically numbered (1, 1.1, 1.2, 1.3, 2, 2.1. and so forth).
218
219 In such lists, each bulleted item is implicitly a separate paragraph,
220 and is formatted individually, with the appropriate indentation:
221
222 * bulleted,
223 * simply numbered (i.e. 1., 2., 3.,
224 etc.), or
225 * hierarchically numbered (1, 1.1,
226 1.2, 1.3, 2, 2.1. and so forth).
227
228 More importantly, if the points are numbered, the numbering is checked
229 and reordered. For example, a list whose points have been rearranged:
230
231 2. Analyze problem
232 3. Design algorithm
233 1. Code solution
234 5. Test
235 4. Ship
236
237 would be renumbered automatically by "autoformat":
238
239 1. Analyze problem
240 2. Design algorithm
241 3. Code solution
242 4. Ship
243 5. Test
244
245 The same reordering would be performed if the "numbering" was by let‐
246 ters ("a." "b." "c." etc.) or Roman numerals ("i." "ii." "iii.)" or by
247 some combination of these ("1a." "1b." "2a." "2b." etc.) Handling dis‐
248 ordered lists of letters and Roman numerals presents an interesting
249 challenge. A list such as:
250
251 C. Put cat in box.
252 D. Close lid.
253 E. Activate Geiger counter.
254
255 should be reordered as "A." "B." "C.," whereas:
256
257 C. Put cat in box.
258 D. Close lid.
259 XLI. Activate Geiger counter.
260
261 should be reordered "I." "II." "III."
262
263 The "autoformat" subroutine solves this problem by always interpreting
264 alphabetic bullets as being letters, unless the full list consists only
265 of valid Roman numerals, at least one of which is two or more charac‐
266 ters long.
267
268 If automatic renumbering isn't wanted, just specify the 'renumber'
269 option with a false value.
270
271 Note that numbers above 1000 at the start of a line are no longer con‐
272 sidered to be paragraph numbering. Numbered paragraphs running that
273 high are exceptionally rare, and much rarer than paragraphs that look
274 like this:
275
276 Although it has long been popular (especially in the year
277 2001) to point out that we now live in the Future, many
278 of the promised miracles of Future Life have failed to
279 eventuate. This is a new phenomenon (it didn't happen in
280 1001) because the idea that the future might be different
281 is a new phenomenon.
282
283 which the former numbering rules caused to be formatted like this:
284
285 Although it has long been popular (especially in the year
286
287 2001) to point out that we now live in the Future, many of the
288 promised miracles of Future Life have failed to eventuate.
289 This is a new phenomenon (it didn't happen in
290
291 2002) because the idea that the future might be different is a
292 new phenomenon.
293
294 but which are now formatted:
295
296 Although it has long been popular (especially in the year 2001)
297 to point out that we now live in the Future, many of the
298 promised miracles of Future Life have failed to eventuate. This
299 is a new phenomenon (it didn't happen in 1001) because the idea
300 that the future might be different is a new phenomenon.
301
302 If you want numbers less than 1000 (or other characters strings cur‐
303 rently treated as bullets) to be ignored in this way, you can turn of
304 list formatting entirely by setting the 'lists' option to a false
305 value.
306
307 Quoting
308
309 Another case in which contiguous lines may be interpreted as belonging
310 to different paragraphs, is where they are quoted with distinct quot‐
311 ers. For example:
312
313 : > CN> So anyway, how can I stop reloads on a web page? Email
314 : > CN> replies only, thanks - I don't read this newsgroup.
315 : > Begone, sirrah! You are a pathetic, Bill-loving,
316 : > microcephalic script-infant.
317 : Sheesh, what's with this group - ask a question, get toasted!
318 : And how *dare* you accuse me of Ianuphilia!
319
320 "autoformat" recognizes the various quoting conventions used in this
321 example and treats it as three paragraphs to be independently reformat‐
322 ted.
323
324 Block quotations present a different challenge. A typical formatter
325 would render the following quotation:
326
327 "We are all of us in the gutter, but some of us are looking at
328 the stars"
329 -- Oscar Wilde
330
331 like so:
332
333 "We are all of us in the gutter, but some of us are looking at
334 the stars" -- Oscar Wilde
335
336 "autoformat" recognizes the quotation structure by matching the follow‐
337 ing regular expression against the text component of each paragraph:
338
339 / \A(\s*) # leading whitespace for quotation (["']⎪``) # opening
340 quotemark (.*) # quotation (''⎪\2) # closing quotemark \s*?\n #
341 trailing whitespace after quotation (\1[ ]+) # leading
342 whitespace for attribution
343 # (must be indented more than
344 # quotation)
345 (--⎪-) # attribution introducer ([^\n]*?\n) # first
346 attribution line ((\5[^\n]*?$)*) # other attribution lines
347 # (indented no less than first line)
348 \s*\Z # optional whitespace to end of paragraph /xsm
349
350 When reformatted (see below), the indentation and the attribution
351 structure will be preserved:
352
353 "We are all of us in the gutter, but some of us are looking at
354 the stars"
355 -- Oscar Wilde
356
357 Widow control
358
359 Note that in the last example, "autoformat" broke the line at column
360 68, four characters earlier than it should have. It did so because, if
361 the full margin width had been used, the formatting would have left the
362 last two words by themselves on an oddly short last line:
363
364 "We are all of us in the gutter, but some of us are looking at
365 the stars"
366
367 This phenomenon is known as "widowing" and is heavily frowned upon in
368 typesetting circles. It looks ugly in plaintext too, so "autoformat"
369 avoids it by stealing extra words from earlier lines in a paragraph, so
370 as to leave enough for a reasonable last line. The heuristic used is
371 that final lines must be at least 10 characters long (though this num‐
372 ber may be adjusted by passing a "widow => minlength" argument to "aut‐
373 oformat").
374
375 If the last line is too short, the paragraph's right margin is reduced
376 by one column, and the paragraph is reformatted. This process iterates
377 until either the last line exceeds nine characters or the margins have
378 been narrowed by 10% of their original separation. In the latter case,
379 the reformatter gives up and uses its original formatting.
380
381 Justification
382
383 The "autoformat" subroutine also takes a named argument: "{justify =>
384 type}", which specifies how each paragraph is to be justified. The
385 options are: 'left' (the default), "'right'," 'centre' (or 'center'),
386 and 'full'. These act on the complete paragraph text (but not on any
387 quoters before that text). For example, with 'right' justification:
388
389 R3> Now is the Winter of our discontent made
390 R4> glorious Summer by this son of York. And all
391 R5> the clouds that lour'd upon our house In the
392 R6> deep bosom of the ocean buried.
393
394 Full justification is interesting in a fixed-width medium like plain‐
395 text because it usually results in uneven spacing between words. Typi‐
396 cally, formatters provide this by distributing the extra spaces into
397 the first available gaps of each line:
398
399 R7> Now is the Winter of our discontent made
400 R8> glorious Summer by this son of York. And all
401 R9> the clouds that lour'd upon our house In
402 R10> the deep bosom of the ocean buried.
403
404 This produces a rather jarring visual effect, so "autoformat" reverses
405 the strategy and inserts extra spaces at the end of lines:
406
407 R11> Now is the Winter of our discontent made
408 R12> glorious Summer by this son of York. And all
409 R13> the clouds that lour'd upon our house In
410 R14> the deep bosom of the ocean buried.
411
412 Most readers find this less disconcerting.
413
414 Implicit centring
415
416 Even if explicit centring is not specified, "autoformat" will attempt
417 to automatically detect centred paragraphs and preserve their justifi‐
418 cation. It does this by examining each line of the paragraph and ask‐
419 ing: "if this line were part of a centred paragraph, where would the
420 centre line have been?"
421
422 The answer can be determined by adding the length of leading whitespace
423 before the first word, plus half the length of the full set of words on
424 the line. That is, for a single line:
425
426 $line =~ /^(\s*)(.*?)(\s*)$/ $centre =
427 length($1)+0.5*length($2);
428
429 By making the same estimate for every line, and then comparing the
430 estimates, it is possible to deduce whether all the lines are centred
431 with respect to the same axis of symmetry (with an allowance of
432 E<plusmn>1 to cater for the inevitable rounding when the centre posi‐
433 tions of even-length rows were originally computed). If a common axis
434 of symmetry is detected, "autoformat" assumes that the lines are sup‐
435 posed to be centred, and switches to centre-justification mode for that
436 paragraph.
437
438 Note that this behaviour can to switched off entirely by setting the
439 "autocentre" argument false.
440
441 Case transformations
442
443 The "autoformat" subroutine can also optionally perform case conver‐
444 sions on the text it processes. The "{case => type}" argument allows
445 the user to specify five different conversions:
446
447 'upper'
448 This mode unconditionally converts every letter in the reformatted
449 text to upper-case;
450
451 'lower'
452 This mode unconditionally converts every letter in the reformatted
453 text to lower-case;
454
455 'sentence'
456 This mode attempts to generate correctly-cased sentences from the
457 input text. That is, the first letter after a sentence-terminating
458 punctuator is converted to upper-case. Then, each subsequent word
459 in the sentence is converted to lower-case, unless that word is
460 originally mixed-case or contains punctuation. For example, under
461 "{case => 'sentence'}":
462
463 'POVERTY, MISERY, ETC. are the lot of the PhD candidate. alas!'
464
465 becomes:
466
467 'Poverty, misery, etc. are the lot of the PhD candidate. Alas!'
468
469 Note that "autoformat" is clever enough to recognize that the
470 period after abbreviations such as "etc." is not a sentence termi‐
471 nator.
472
473 If the argument is specified as 'sentence ' (with one or more
474 trailing whitespace characters) those characters are used to
475 replace the single space that appears at the end of the sentence.
476 For example, "autoformat($text, {case=>'sentence '}") would pro‐
477 duce:
478
479 'Poverty, misery, etc. are the lot of the PhD candidate. Alas!'
480
481 'title'
482 This mode behaves like 'sentence' except that the first letter of
483 every word is capitalized:
484
485 'What I Did On My Summer Vacation In Monterey'
486
487 'highlight'
488 This mode behaves like 'title' except that trivial words are not
489 capitalized:
490
491 'What I Did on my Summer Vacation in Monterey'
492
493 Selective reformatting
494
495 You can select which paragraphs "autoformat" actually reformats (or,
496 rather, those it doesn't reformat) using the "ignore" flag.
497
498 For example:
499
500 # Reformat all paras except those containing "verbatim"...
501 print autoformat { all => 1, ignore => qr/verbatim/i }, $text;
502
503 # Reformat all paras except those less that 3 lines long...
504 print autoformat { all => 1, ignore => sub { tr/\n/\n/ < 3
505 } }, $text;
506
507 # Reformat all paras except those that are indented...
508 print autoformat { all => 1, ignore => qr/^\s/m }, $text;
509
510 # Reformat all paras except those that are indented (easier)...
511 print autoformat { all => 1, ignore => 'indented' }, $text;
512
514 The Text::Reform module
515
517 Damian Conway (damian@conway.org)
518
520 There are undoubtedly serious bugs lurking somewhere in code this funky
521 :-) Bug reports and other feedback are most welcome.
522
524 Copyright (c) 1997-2000, Damian Conway. All Rights Reserved. This mod‐
525 ule is free software. It may be used, redistributed and/or modified
526 under the terms of the Perl Artistic License (see
527 http://www.perl.com/perl/misc/Artistic.html)
528
529
530
531perl v5.8.8 2005-05-04 Text::Autoformat(3)