HTML::WikiConverter::Dialects(3pm)

1HTML::WikiConverter::DiUasleerctCso(n3t)ributed Perl DocHuTmMeLn:t:aWtiikoinConverter::Dialects(3)
2
3
4

NAME

6       HTML::WikiConverter::Dialects - How to add a dialect
7

SYNOPSIS

9         # In your dialect module:
10
11         package HTML::WikiConverter::MySlimWiki;
12         use base 'HTML::WikiConverter';
13
14         sub rules { {
15           b => { start => '**', end => '**' },
16           i => { start => '//', end => '//' },
17           strong => { alias => 'b' },
18           em => { alias => 'i' },
19           hr => { replace => "\n----\n" }
20         } }
21
22         # In a nearby piece of code:
23
24         package main;
25         use Test::More tests => 5;
26
27         my $wc = new HTML::WikiConverter(
28           dialect => 'MySlimWiki'
29         );
30
31         is( $wc->html2wiki( '<b>text</b>' ), '**text**', b );
32         is( $wc->html2wiki( '<i>text</i>' ), '//text//', i );
33         is( $wc->html2wiki( '<strong>text</strong>' ), '**text**', 'strong' );
34         is( $wc->html2wiki( '<em>text</em>' ), '//text//', 'em' );
35         is( $wc->html2wiki( '<hr/>' ), '----', 'hr' );
36

DESCRIPTION

38       HTML::WikiConverter (or H::WC, for short) is an HTML to wiki converter.
39       It can convert HTML source into a variety of wiki markups, called wiki
40       "dialects".  This manual describes how you to create your own dialect
41       to be plugged into HTML::WikiConverter.
42

DIALECTS

44       Each dialect has a separate dialect module containing rules for
45       converting HTML into wiki markup specific for that dialect. Currently,
46       all dialect modules are in the "HTML::WikiConverter::" package space
47       and subclass HTML::WikiConverter. For example, the MediaWiki dialect
48       module is HTML::WikiConverter::MediaWiki, while PhpWiki's is
49       HTML::WikiConverter::PhpWiki. However, dialect modules need not be in
50       the "HTML::WikiConverter::" package space; you may just as easily use
51       "package MyWikiDialect;" and H::WC will Do The Right Thing.
52
53       From now on, I'll be using the terms "dialect" and "dialect module"
54       interchangeably.
55
56   Subclassing
57       To interface with H::WC, dialects need to subclass it. This is done
58       like so at the start of the dialect module:
59
60         package HTML::WikiConverter::MySlimWiki;
61         use base 'HTML::WikiConverter';
62
63   Conversion rules
64       Dialects guide H::WC's conversion process with a set of rules that
65       define how HTML elements are turned into their wiki counterparts.  Each
66       rule corresponds to an HTML tag and there may be any number of rules.
67       Rules are specified in your dialect's "rules()" method, which returns a
68       reference to a hash of rules. Each entry in the hash maps a tag name to
69       a set of subrules, as in:
70
71           $tag => \%subrules
72
73       where $tag is the name of the HTML tag (e.g., "b", "em", etc.)  and
74       %subrules contains subrules that specify how that tag will be converted
75       when it is encountered in the HTML input.
76
77       Subrules
78
79       The following subrules are recognized:
80
81         start
82         end
83
84         preserve
85         attributes
86         empty
87
88         replace
89         alias
90
91         block
92         line_format
93         line_prefix
94
95         trim
96
97       A simple example
98
99       The following rules could be used for a dialect that uses "*asterisks*"
100       for bold and "_underscores_" for italic text:
101
102         sub rules {
103           b => { start => '*', end => '*' },
104           i => { start => '_', end => '_' },
105         }
106
107       Aliases
108
109       To add "<strong>" and "<em>" as aliases of "<b>" and "<i>", use the
110       "alias" subrule:
111
112         strong => { alias => 'b' },
113         em => { alias => 'i' },
114
115       (The "alias" subrule cannot be used with any other subrule.)
116
117       Blocks
118
119       Many dialects separate paragraphs and other block-level elements with a
120       blank line. To indicate this, use the "block" subrule:
121
122         p => { block => 1 },
123
124       (To better support nested block elements, if a block elements are
125       nested inside each other, blank lines are only added to the outermost
126       element.)
127
128       Line formatting
129
130       Many dialects require that the text of an element be contained on a
131       single line of text, or that it cannot contain any newlines, etc. These
132       options can be specified using the "line_format" subrule, which can be
133       assigned the value "single", "multi", or "blocks".
134
135       If the element must be contained on a single line, then the
136       "line_format" subrule should be "single". If the element can span
137       multiple lines, but there can be no blank lines contained within, then
138       use "multi". If blank lines (which delimit blocks) are allowed, then
139       use "blocks". For example, paragraphs are specified like so in the
140       MediaWiki dialect:
141
142         p => { block => 1, line_format => 'multi', trim => 'both' },
143
144       Trimming whitespace
145
146       The "trim" subrule specifies whether leading or trailing whitespace (or
147       both) should be stripped from the element. To strip leading whitespace
148       only, use "leading"; for trailing whitespace, use "trailing"; for both,
149       use the aptly named "both"; for neither (the default), use "none".
150
151       Line prefixes
152
153       Some elements require that each line be prefixed with a particular
154       string. This is specified with the "line_prefix" subrule. For example,
155       preformatted text in MediaWiki is prefixed with a space:
156
157         pre => { block => 1, line_prefix => ' ' },
158
159       Replacement
160
161       In some cases, conversion from HTML to wiki markup is as simple as
162       string replacement. To replace a tag and its contents with a particular
163       string, use the "replace" subrule. For example, in PhpWiki, three
164       percent signs, "%%%", represents a line break, "<br>", hence:
165
166         br => { replace => '%%%' },
167
168       (The "replace" subrule cannot be used with any other subrule.)
169
170       Preserving HTML tags
171
172       Some dialects allow a subset of HTML in their markup. While H::WC
173       ignores unhandled HTML tags by default (i.e., if H::WC encounters a tag
174       that does not exist in a dialect's rule specification, then the
175       contents of the tag is simply passed through to the wiki markup), you
176       may specify that some be preserved using the "preserve" subrule. For
177       example, to allow "<font>" tag in wiki markup:
178
179         font => { preserve => 1 },
180
181       Preserved tags may also specify a list of attributes that may also
182       passthrough from HTML to wiki markup. This is done with the
183       "attributes" subrule:
184
185         font => { preserve => 1, attributes => [ qw/ style class / ] },
186
187       (The "attributes" subrule can only be used if the "preserve" subrule is
188       also present.)
189
190       Some HTML elements have no content (e.g., line breaks, images) and the
191       wiki dialect might require them to be preserved in a more XHTML-
192       friendly way. To indicate that a preserved tag should have no content,
193       use the "empty" subrule. This will cause the element to be replaced
194       with "<tag />" and no end tag. For example, MediaWiki handles line
195       breaks like so:
196
197         br => {
198           preserve => 1,
199           attributes => [ qw/ id class title style clear / ],
200           empty => 1
201         },
202
203       This will convert, for example, "<br clear='both'>" into "<br
204       clear='both' />". Without specifying the "empty" subrule, this would be
205       converted into the (probably undesirable) "<br clear='both'></br>".
206
207       (The "empty" subrule can only be used if the "preserve" subrule is also
208       present.)
209
210       Rules that depend on attribute values
211
212       In some circumstances, you might want your dialect's conversion rules
213       to depend on the value of one or more attributes. This can be achieved
214       by producing rules in a conditional manner within "rules()". For
215       example:
216
217         sub rules {
218           my $self = shift;
219
220           my %rules = (
221             em => { start => "''", end => "''" },
222             strong => { start => "'''", end => "'''" },
223           );
224
225           $rules{i} = { preserve => 1 } if $self->preserve_italic;
226           $rules{b} = { preserve => 1 } if $self->preserve_bold;
227
228           return \%rules;
229         }
230
231   Dynamic subrules
232       Instead of simple strings, you may use coderefs as values for the
233       "start", "end", "replace", and "line_prefix" subrules. If you do, the
234       code will be called when the subrule is applied, and will be passed
235       three arguments: the current H::WC object, the current HTML::Element
236       node being operated on, and a reference to the hash containing the
237       dialect's subrules associated with elements of that type.
238
239       For example, MoinMoin handles lists like so:
240
241         ul => { line_format => 'multi', block => 1, line_prefix => '  ' },
242         li => { start => \&_li_start, trim => 'leading' },
243         ol => { alias => 'ul' },
244
245       It then defines "_li_start()":
246
247         sub _li_start {
248           my( $self, $node, $subrules ) = @_;
249           my $bullet = '';
250           $bullet = '*'  if $node->parent->tag eq 'ul';
251           $bullet = '1.' if $node->parent->tag eq 'ol';
252           return "\n$bullet ";
253         }
254
255       This prefixes every unordered list item with "*" and every ordered list
256       item with "1.", which MoinMoin requires. It also puts each list item on
257       its own line and places a space between the prefix and the content of
258       the list item.
259
260   Subrule validation
261       Certain subrule combinations are not allowed. Hopefully it's intuitive
262       why this is, but in case it's not, prohibited combinations have been
263       mentioned above parenthetically. For example, the "replace" and "alias"
264       subrules cannot be combined with any other subrules, and "attributes"
265       can only be specified alongside "preserve". Invalid subrule
266       combinations will trigger a fatal error when the H::WC object is
267       instantiated.
268
269   Dialect attributes
270       H::WC's constructor accepts a number of attributes that help determine
271       how conversion takes place. Dialects can alter these attributes or add
272       their own by defining an "attributes()" method, which returns a
273       reference to a hash of attributes. Each entry in the hash maps the
274       attribute's name to an attribute specification, as in:
275
276         $attr => \%spec
277
278       where $attr is the name of the attribute and %spec is a
279       Params::Validate specification for the attribute.
280
281       For example, to add a boolean attribute called "camel_case" which is
282       disabled by default:
283
284         sub attributes {
285           camel_case => { default => 0 },
286         }
287
288       Attributes defined liks this are given accessor and mutator methods via
289       Perl's "AUTOLOAD" mechanism, so you can later say:
290
291         my $ok = $wc->camel_case;
292         $wc->camel_case(0);
293
294       You may override the default H::WC attributes using this mechanism. For
295       example, while H::WC considers the "base_uri" attribute optional, it is
296       required for the PbWiki dialect.  PbWiki can override this default-
297       optional behavior by saying:
298
299         sub attributes {
300           base_uri => { optional => 0 }
301         }
302
303   Preprocessing
304       The first step H::WC takes in converting HTML source to wiki markup is
305       to parse the HTML into a syntax tree using HTML::TreeBuilder. It is
306       often useful for dialects to preprocess the tree prior to converting it
307       into wiki markup. Dialects that need to preprocess the tree can define
308       a "preprocess_node" method that will be called on each node of the tree
309       (traversal is done in pre-order). The method receives two arguments,
310       the H::WC object, and the current HTML::Element node being traversed.
311       It may modify the node or decide to ignore it; its return value is
312       discarded.
313
314       Built-in preprocessors
315
316       Because they are commonly needed, H::WC automatically carries out two
317       preprocessing steps, regardless of the dialect: 1) relative URIs in
318       images and links are converted to absolute URIs (based upon the
319       "base_uri" parameter), and 2) ignorable text (e.g. between a "</td>"
320       and "<td>") is discarded.
321
322       H::WC also provides additional preprocessing steps that may be
323       explicitly enabled by dialect modules.
324
325       strip_aname
326           Removes any anchor elements that do not contain an "href"
327           attribute.
328
329       caption2para
330           Removes table captions and reinserts them as paragraphs before the
331           table.
332
333       Dialects may apply these optional preprocessing steps by calling them
334       as methods on the dialect object inside "preprocess_node". For example:
335
336         sub preprocess_node {
337           my( $self, $node ) = @_;
338           $self->strip_aname($node);
339           $self->caption2para($node);
340         }
341
342   Postprocessing
343       Once the work of converting HTML is complete, it is sometimes useful to
344       postprocess the resulting wiki markup. Postprocessing can be used to
345       clean up whitespace, fix subtle bugs introduced in the markup during
346       conversion, etc.
347
348       Dialects that want to postprocess the wiki markup should define a
349       "postprocess_output" method that will be called just before the
350       "html2wiki" method returns to the client. The method will be passed two
351       arguments, the H::WC object and a reference to the wiki markup. The
352       method may modify the wiki markup that the reference points to; its
353       return value is discarded.
354
355       For example, to replace a series of line breaks with a pair of
356       newlines, a dialect might implement this:
357
358         sub postprocess_output {
359           my( $self, $outref ) = @_;
360           $$outref =~ s/<br>\s*<br>/\n\n/gs;
361         }
362
363       (This example assumes that HTML line breaks were replaced with "<br>"
364       in the wiki markup.)
365
366   Dialect utility methods
367       H::WC defines a set of utility methods that dialect modules may find
368       useful.
369
370       get_elem_contents
371
372         my $wiki = $wc->get_elem_contents( $node );
373
374       Converts the contents of $node into wiki markup and returns the
375       resulting wiki markup.
376
377       get_wiki_page
378
379         my $title = $wc->get_wiki_page( $url );
380
381       Attempts to extract the title of a wiki page from the given URL,
382       returning the title on success, "undef" on failure. If "wiki_uri" is
383       empty, this method always return "undef". See "ATTRIBUTES" in
384       HTML::WikiConverter for details on how the "wiki_uri" attribute is
385       interpreted.
386
387       is_camel_case
388
389         my $ok = $wc->is_camel_case( $str );
390
391       Returns true if $str is in CamelCase, false otherwise. CamelCase-ness
392       is determined using the same rules that Kwiki's formatting module uses.
393
394       get_attr_str
395
396         my $attr_str = $wc->get_attr_str( $node, @attrs );
397
398       Returns a string containing the specified attributes in the given node.
399       The returned string is suitable for insertion into an HTML tag.  For
400       example, if $node contains the HTML
401
402         <style id="ht" class="head" onclick="editPage()">Header</span>
403
404       and @attrs contains "id" and "class", then "get_attr_str()" will return
405       'id="ht" class="head"'.
406
407       _attr
408
409         my $value = $wc->_attr( $name );
410
411       Returns the value of the named attribute. This is rarely needed since
412       you can access attribute values by treating the attribute name as a
413       method (i.e., "$wc->$name"). This low-level method of accessing
414       attributes is provided for when you need to override an attribute's
415       accessor/mutator method, as in:
416
417         sub attributes { {
418           my_attr => { default => 1 },
419         } }
420
421         sub my_attr {
422           my( $wc, $name, $new_value ) = @_;
423           # do something special
424           return $wc->_attr( $name => $new_value );
425         }
426

AUTHOR

428       David J. Iberri <diberri@cpan.org>
429

COPYRIGHT & LICENSE

431       Copyright 2006 David J. Iberri, all rights reserved.
432
433       This program is free software; you can redistribute it and/or modify it
434       under the same terms as Perl itself.
435
436
437
438perl v5.34.0                      2022-01-21  HTML::WikiConverter::Dialects(3)