1Text::Autoformat(3)   User Contributed Perl Documentation  Text::Autoformat(3)
2
3
4

NAME

6       Text::Autoformat - Automatic text wrapping and reformatting
7

VERSION

9       This document describes version 1.13 of Text::Autoformat, released May
10       4, 2005.
11

SYNOPSIS

13        # Minimal use: read from STDIN, format to STDOUT...
14
15               use Text::Autoformat;
16               autoformat;
17
18        # In-memory formatting...
19
20               $formatted = autoformat $rawtext;
21
22        # Configuration...
23
24               $formatted = autoformat $rawtext, { %options };
25
26        # Margins (1..72 by default)...
27
28               $formatted = autoformat $rawtext, { left=>8, right=>70 };
29
30        # Justification (left by default)...
31
32               $formatted = autoformat $rawtext, { justify => 'left' };
33               $formatted = autoformat $rawtext, { justify => 'right' };
34               $formatted = autoformat $rawtext, { justify => 'full' };
35               $formatted = autoformat $rawtext, { justify => 'centre' };
36
37        # Filling (does so by default)...
38
39               $formatted = autoformat $rawtext, { fill=>0 };
40
41        # Squeezing whitespace (does so by default)...
42
43               $formatted = autoformat $rawtext, { squeeze=>0 };
44
45        # Case conversions...
46
47               $formatted = autoformat $rawtext, { case => 'lower' };
48               $formatted = autoformat $rawtext, { case => 'upper' };
49               $formatted = autoformat $rawtext, { case => 'sentence' };
50               $formatted = autoformat $rawtext, { case => 'title' };
51               $formatted = autoformat $rawtext, { case => 'highlight' };
52
53        # Selective reformatting
54
55               $formatted = autoformat $rawtext, { ignore=>qr/^\t/ };
56

BACKGROUND

58       The problem
59
60       Perl plaintext formatters just aren't smart enough. Given a typical
61       piece of plaintext in need of formatting:
62
63               In comp.lang.perl.misc you wrote:
64               : > <CN = Clooless Noobie> writes:
65               : > CN> PERL sux because:
66               : > CN>    * It doesn't have a switch statement and you have to put $
67               : > CN>signs in front of everything
68               : > CN>    * There are too many OR operators: having ⎪, ⎪⎪ and 'or'
69               : > CN>operators is confusing
70               : > CN>    * VB rools, yeah!!!!!!!!!
71               : > CN> So anyway, how can I stop reloads on a web page?
72               : > CN> Email replies only, thanks - I don't read this newsgroup.
73               : >
74               : > Begone, sirrah! You are a pathetic, Bill-loving, microcephalic
75               : > script-infant.
76               : Sheesh, what's with this group - ask a question, get toasted! And how
77               : *dare* you accuse me of Ianuphilia!
78
79       both the venerable Unix fmt tool and Perl's standard Text::Wrap module
80       produce:
81
82               In comp.lang.perl.misc you wrote:  : > <CN = Clooless Noobie>
83               writes:  : > CN> PERL sux because:  : > CN>    * It doesn't
84               have a switch statement and you have to put $ : > CN>signs in
85               front of everything : > CN>    * There are too many OR
86               operators: having ⎪, ⎪⎪ and 'or' : > CN>operators is confusing
87               : > CN>    * VB rools, yeah!!!!!!!!!  : > CN> So anyway, how
88               can I stop reloads on a web page?  : > CN> Email replies only,
89               thanks - I don't read this newsgroup.  : > : > Begone, sirrah!
90               You are a pathetic, Bill-loving, microcephalic : >
91               script-infant.  : Sheesh, what's with this group - ask a
92               question, get toasted! And how : *dare* you accuse me of
93               Ianuphilia!
94
95       Other formatting modules -- such as Text::Correct and Text::Format --
96       provide more control over their output, but produce equally poor
97       results when applied to arbitrary input. They simply don't understand
98       the structural conventions of the text they're reformatting.
99
100       The solution
101
102       The Text::Autoformat module provides a subroutine named "autoformat"
103       that wraps text to specified margins. However, "autoformat" reformats
104       its input by analysing the text's structure, so it wraps the above
105       example like so:
106
107               In comp.lang.perl.misc you wrote:
108               : > <CN = Clooless Noobie> writes:
109               : > CN> PERL sux because:
110               : > CN>    * It doesn't have a switch statement and you
111               : > CN>      have to put $ signs in front of everything
112               : > CN>    * There are too many OR operators: having ⎪, ⎪⎪
113               : > CN>      and 'or' operators is confusing
114               : > CN>    * VB rools, yeah!!!!!!!!! So anyway, how can I
115               : > CN>      stop reloads on a web page? Email replies
116               : > CN>      only, thanks - I don't read this newsgroup.
117               : >
118               : > Begone, sirrah! You are a pathetic, Bill-loving,
119               : > microcephalic script-infant.
120               : Sheesh, what's with this group - ask a question, get toasted!
121               : And how *dare* you accuse me of Ianuphilia!
122
123       Note that the various quoting conventions have been observed. In fact,
124       their structure has been used to determine where some paragraphs begin.
125       Furthermore "autoformat" correctly distinguished between the leading
126       '*' bullets of the nested list (which were outdented) and the leading
127       emphatic '*' of "*dare*" (which was inlined).
128

DESCRIPTION

130       Paragraphs
131
132       The fundamental task of the "autoformat" subroutine is to identify and
133       rearrange independent paragraphs in a text. Paragraphs typically con‐
134       sist of a series of lines containing at least one non-whitespace char‐
135       acter, followed by one or more lines containing only optional white‐
136       space.  This is a more liberal definition than many other formatters
137       use: most require an empty line to terminate a paragraph. Paragraphs
138       may also be denoted by bulleting, numbering, or quoting (see the fol‐
139       lowing sections).
140
141       Once a paragraph has been isolated, "autoformat" fills and re-wraps its
142       lines according to the margins that are specified in its argument list.
143       These are placed after the text to be formatted, in a hash reference:
144
145               $tidied = autoformat($messy, {left=>20, right=>60});
146
147       By default, "autoformat" uses a left margin of 1 (first column) and a
148       right margin of 72.
149
150       You can also control whether (and how) "autoformat" breaks words at the
151       end of a line, using the 'break' option:
152
153               # Turn off all hyphenation
154               use Text::Autoformat qw(autoformat break_wrap);
155               $tidied = autoformat($messy, {break=>break_wrap});
156
157               # Default hyphenation
158               use Text::Autoformat qw(autoformat break_at);
159               $tidied = autoformat($messy, {break=>break_at('-')});
160
161               # Use TeX::Hyphen module's hyphenation (module must be installed)
162               use Text::Autoformat qw(autoformat break_TeX);
163               $tidied = autoformat($messy, {break=>break_TeX});
164
165       Normally, "autoformat" only reformats the first paragraph it encoun‐
166       ters, and leaves the remainder of the text unaltered. This behaviour is
167       useful because it allows a one-liner invoking the subroutine to be
168       mapped onto a convenient keystroke in a text editor, to provide one-
169       paragraph-at-a-time reformatting:
170
171               % cat .exrc
172
173               map f !Gperl -MText::Autoformat -e'autoformat'
174
175       (Note that to facilitate such one-liners, if "autoformat" is called in
176       a void context without any text data, it takes its text from "STDIN"
177       and writes its result to "STDOUT").
178
179       To enable "autoformat" to rearrange the entire input text at once, the
180       "all" argument is used:
181
182               $tidied_all = autoformat($messy, {left=>20, right=>60, all=>1});
183
184       "autoformat" can also be directed to selectively reformat paragraphs,
185       using the "ignore" argument:
186
187               $tidied_some = autoformat($messy, {ignore=>qr/^[ \t]/});
188
189       The value for "ignore" may be a "qr"'d regex, a subroutine reference,
190       or the special string 'indented'.
191
192       If a regex is specified, any paragraph whose original text matches that
193       regex will not be reformatted (i.e. it will be printed verbatim).
194
195       If a subroutine is specified, that subroutine will be called once for
196       each paragraph (with $_ set to the paragraph's text). The subroutine is
197       expected to return a true or false value. If it returns true, the para‐
198       graph will not be reformatted.
199
200       If the value of the "ignore" option is the string 'indented', "autofor‐
201       mat" will ignore any paragraph in which every line begins with a white‐
202       space.
203
204       One other special case of ignorance is ignoring mail headers and signa‐
205       ture.  This option is specified using the "mail" argument:
206
207               $tidied_mesg = autoformat($messy_mesg, {mail=>1});
208
209       Note that the "mail" option automatically implies "all".
210
211       Bulleting and (re-)numbering
212
213       Often plaintext will include lists that are either:
214
215               * bulleted,
216               * simply numbered (i.e. 1., 2., 3., etc.), or
217               * hierarchically numbered (1, 1.1, 1.2, 1.3, 2, 2.1. and so forth).
218
219       In such lists, each bulleted item is implicitly a separate paragraph,
220       and is formatted individually, with the appropriate indentation:
221
222               * bulleted,
223               * simply numbered (i.e. 1., 2., 3.,
224                 etc.), or
225               * hierarchically numbered (1, 1.1,
226                 1.2, 1.3, 2, 2.1. and so forth).
227
228       More importantly, if the points are numbered, the numbering is checked
229       and reordered. For example, a list whose points have been rearranged:
230
231               2. Analyze problem
232               3. Design algorithm
233               1. Code solution
234               5. Test
235               4. Ship
236
237       would be renumbered automatically by "autoformat":
238
239               1. Analyze problem
240               2. Design algorithm
241               3. Code solution
242               4. Ship
243               5. Test
244
245       The same reordering would be performed if the "numbering" was by let‐
246       ters ("a." "b." "c." etc.) or Roman numerals ("i." "ii." "iii.)" or by
247       some combination of these ("1a." "1b." "2a." "2b." etc.) Handling dis‐
248       ordered lists of letters and Roman numerals presents an interesting
249       challenge. A list such as:
250
251               C. Put cat in box.
252               D. Close lid.
253               E. Activate Geiger counter.
254
255       should be reordered as "A." "B." "C.," whereas:
256
257               C. Put cat in box.
258               D. Close lid.
259               XLI. Activate Geiger counter.
260
261       should be reordered "I." "II." "III."
262
263       The "autoformat" subroutine solves this problem by always interpreting
264       alphabetic bullets as being letters, unless the full list consists only
265       of valid Roman numerals, at least one of which is two or more charac‐
266       ters long.
267
268       If automatic renumbering isn't wanted, just specify the 'renumber'
269       option with a false value.
270
271       Note that numbers above 1000 at the start of a line are no longer con‐
272       sidered to be paragraph numbering. Numbered paragraphs running that
273       high are exceptionally rare, and much rarer than paragraphs that look
274       like this:
275
276               Although it has long been popular (especially in the year
277               2001) to point out that we now live in the Future, many
278               of the promised miracles of Future Life have failed to
279               eventuate. This is a new phenomenon (it didn't happen in
280               1001) because the idea that the future might be different
281               is a new phenomenon.
282
283       which the former numbering rules caused to be formatted like this:
284
285               Although it has long been popular (especially in the year
286
287               2001) to point out that we now live in the Future, many of the
288                     promised miracles of Future Life have failed to eventuate.
289                     This is a new phenomenon (it didn't happen in
290
291               2002) because the idea that the future might be different is a
292                     new phenomenon.
293
294       but which are now formatted:
295
296               Although it has long been popular (especially in the year 2001)
297               to point out that we now live in the Future, many of the
298               promised miracles of Future Life have failed to eventuate. This
299               is a new phenomenon (it didn't happen in 1001) because the idea
300               that the future might be different is a new phenomenon.
301
302       If you want numbers less than 1000 (or other characters strings cur‐
303       rently treated as bullets) to be ignored in this way, you can turn of
304       list formatting entirely by setting the 'lists' option to a false
305       value.
306
307       Quoting
308
309       Another case in which contiguous lines may be interpreted as belonging
310       to different paragraphs, is where they are quoted with distinct quot‐
311       ers.  For example:
312
313               : > CN> So anyway, how can I stop reloads on a web page? Email
314               : > CN> replies only, thanks - I don't read this newsgroup.
315               : > Begone, sirrah! You are a pathetic, Bill-loving,
316               : > microcephalic script-infant.
317               : Sheesh, what's with this group - ask a question, get toasted!
318               : And how *dare* you accuse me of Ianuphilia!
319
320       "autoformat" recognizes the various quoting conventions used in this
321       example and treats it as three paragraphs to be independently reformat‐
322       ted.
323
324       Block quotations present a different challenge. A typical formatter
325       would render the following quotation:
326
327               "We are all of us in the gutter, but some of us are looking at
328                the stars"
329                                       -- Oscar Wilde
330
331       like so:
332
333               "We are all of us in the gutter, but some of us are looking at
334               the stars" -- Oscar Wilde
335
336       "autoformat" recognizes the quotation structure by matching the follow‐
337       ing regular expression against the text component of each paragraph:
338
339               / \A(\s*) # leading whitespace for quotation (["']⎪``) # opening
340               quotemark (.*) # quotation (''⎪\2) # closing quotemark \s*?\n #
341               trailing whitespace after quotation (\1[ ]+) # leading
342               whitespace for attribution
343                                       #   (must be indented more than
344                                       #   quotation)
345                 (--⎪-) # attribution introducer ([^\n]*?\n) # first
346                 attribution line ((\5[^\n]*?$)*) # other attribution lines
347                                       #   (indented no less than first line)
348                 \s*\Z # optional whitespace to end of paragraph /xsm
349
350       When reformatted (see below), the indentation and the attribution
351       structure will be preserved:
352
353               "We are all of us in the gutter, but some of us are looking at
354                the stars"
355                                       -- Oscar Wilde
356
357       Widow control
358
359       Note that in the last example, "autoformat" broke the line at column
360       68, four characters earlier than it should have. It did so because, if
361       the full margin width had been used, the formatting would have left the
362       last two words by themselves on an oddly short last line:
363
364               "We are all of us in the gutter, but some of us are looking at
365               the stars"
366
367       This phenomenon is known as "widowing" and is heavily frowned upon in
368       typesetting circles. It looks ugly in plaintext too, so "autoformat"
369       avoids it by stealing extra words from earlier lines in a paragraph, so
370       as to leave enough for a reasonable last line. The heuristic used is
371       that final lines must be at least 10 characters long (though this num‐
372       ber may be adjusted by passing a "widow => minlength" argument to "aut‐
373       oformat").
374
375       If the last line is too short, the paragraph's right margin is reduced
376       by one column, and the paragraph is reformatted. This process iterates
377       until either the last line exceeds nine characters or the margins have
378       been narrowed by 10% of their original separation. In the latter case,
379       the reformatter gives up and uses its original formatting.
380
381       Justification
382
383       The "autoformat" subroutine also takes a named argument: "{justify =>
384       type}", which specifies how each paragraph is to be justified.  The
385       options are: 'left' (the default), "'right'," 'centre' (or 'center'),
386       and 'full'. These act on the complete paragraph text (but not on any
387       quoters before that text). For example, with 'right' justification:
388
389                R3>     Now is the Winter of our discontent made
390                R4> glorious Summer by this son of York. And all
391                R5> the clouds that lour'd upon our house In the
392                R6>              deep bosom of the ocean buried.
393
394       Full justification is interesting in a fixed-width medium like plain‐
395       text because it usually results in uneven spacing between words. Typi‐
396       cally, formatters provide this by distributing the extra spaces into
397       the first available gaps of each line:
398
399                R7> Now is the Winter of our discontent made
400                R8> glorious Summer by this son of York. And all
401                R9> the clouds that lour'd upon our house In
402               R10> the deep bosom of the ocean buried.
403
404       This produces a rather jarring visual effect, so "autoformat" reverses
405       the strategy and inserts extra spaces at the end of lines:
406
407               R11> Now is the Winter of our discontent made
408               R12> glorious Summer by this son of York. And all
409               R13> the clouds that lour'd upon our house In
410               R14> the deep bosom of the ocean buried.
411
412       Most readers find this less disconcerting.
413
414       Implicit centring
415
416       Even if explicit centring is not specified, "autoformat" will attempt
417       to automatically detect centred paragraphs and preserve their justifi‐
418       cation. It does this by examining each line of the paragraph and ask‐
419       ing: "if this line were part of a centred paragraph, where would the
420       centre line have been?"
421
422       The answer can be determined by adding the length of leading whitespace
423       before the first word, plus half the length of the full set of words on
424       the line. That is, for a single line:
425
426               $line =~ /^(\s*)(.*?)(\s*)$/ $centre =
427               length($1)+0.5*length($2);
428
429       By making the same estimate for every line, and then comparing the
430       estimates, it is possible to deduce whether all the lines are centred
431       with respect to the same axis of symmetry (with an allowance of
432       E<plusmn>1 to cater for the inevitable rounding when the centre posi‐
433       tions of even-length rows were originally computed). If a common axis
434       of symmetry is detected, "autoformat" assumes that the lines are sup‐
435       posed to be centred, and switches to centre-justification mode for that
436       paragraph.
437
438       Note that this behaviour can to switched off entirely by setting the
439       "autocentre" argument false.
440
441       Case transformations
442
443       The "autoformat" subroutine can also optionally perform case conver‐
444       sions on the text it processes. The "{case => type}" argument allows
445       the user to specify five different conversions:
446
447       'upper'
448           This mode unconditionally converts every letter in the reformatted
449           text to upper-case;
450
451       'lower'
452           This mode unconditionally converts every letter in the reformatted
453           text to lower-case;
454
455       'sentence'
456           This mode attempts to generate correctly-cased sentences from the
457           input text. That is, the first letter after a sentence-terminating
458           punctuator is converted to upper-case. Then, each subsequent word
459           in the sentence is converted to lower-case, unless that word is
460           originally mixed-case or contains punctuation. For example, under
461           "{case => 'sentence'}":
462
463                   'POVERTY, MISERY, ETC. are the lot of the PhD candidate. alas!'
464
465           becomes:
466
467                   'Poverty, misery, etc. are the lot of the PhD candidate. Alas!'
468
469           Note that "autoformat" is clever enough to recognize that the
470           period after abbreviations such as "etc." is not a sentence termi‐
471           nator.
472
473           If the argument is specified as 'sentence ' (with one or more
474           trailing whitespace characters) those characters are used to
475           replace the single space that appears at the end of the sentence.
476           For example, "autoformat($text, {case=>'sentence '}") would pro‐
477           duce:
478
479                   'Poverty, misery, etc. are the lot of the PhD candidate. Alas!'
480
481       'title'
482           This mode behaves like 'sentence' except that the first letter of
483           every word is capitalized:
484
485                   'What I Did On My Summer Vacation In Monterey'
486
487       'highlight'
488           This mode behaves like 'title' except that trivial words are not
489           capitalized:
490
491                   'What I Did on my Summer Vacation in Monterey'
492
493       Selective reformatting
494
495       You can select which paragraphs "autoformat" actually reformats (or,
496       rather, those it doesn't reformat) using the "ignore" flag.
497
498       For example:
499
500               # Reformat all paras except those containing "verbatim"...
501               print autoformat { all => 1, ignore => qr/verbatim/i }, $text;
502
503               # Reformat all paras except those less that 3 lines long...
504               print autoformat { all => 1, ignore => sub { tr/\n/\n/ < 3
505               } }, $text;
506
507               # Reformat all paras except those that are indented...
508               print autoformat { all => 1, ignore => qr/^\s/m }, $text;
509
510               # Reformat all paras except those that are indented (easier)...
511               print autoformat { all => 1, ignore => 'indented' }, $text;
512

SEE ALSO

514       The Text::Reform module
515

AUTHOR

517       Damian Conway (damian@conway.org)
518

BUGS

520       There are undoubtedly serious bugs lurking somewhere in code this funky
521       :-) Bug reports and other feedback are most welcome.
522
524       Copyright (c) 1997-2000, Damian Conway. All Rights Reserved. This mod‐
525       ule is free software. It may be used, redistributed and/or modified
526       under the terms of the Perl Artistic License (see
527       http://www.perl.com/perl/misc/Artistic.html)
528
529
530
531perl v5.8.8                       2005-05-04               Text::Autoformat(3)
Impressum