1HTML::WikiConverter::DiUasleerctCso(n3t)ributed Perl DocHuTmMeLn:t:aWtiikoinConverter::Dialects(3)
2
3
4

NAME

6       HTML::WikiConverter::Dialects - How to add a dialect
7

SYNOPSIS

9         # In your dialect module:
10
11         package HTML::WikiConverter::MySlimWiki;
12         use base 'HTML::WikiConverter';
13
14         sub rules { {
15           b => { start => '**', end => '**' },
16           i => { start => '//', end => '//' },
17           strong => { alias => 'b' },
18           em => { alias => 'i' },
19           hr => { replace => "\n----\n" }
20         } }
21
22         # In a nearby piece of code:
23
24         package main;
25         use Test::More tests => 5;
26
27         my $wc = new HTML::WikiConverter(
28           dialect => 'MySlimWiki'
29         );
30
31         is( $wc->html2wiki( '<b>text</b>' ), '**text**', b );
32         is( $wc->html2wiki( '<i>text</i>' ), '//text//', i );
33         is( $wc->html2wiki( '<strong>text</strong>' ), '**text**', 'strong' );
34         is( $wc->html2wiki( '<em>text</em>' ), '//text//', 'em' );
35         is( $wc->html2wiki( '<hr/>' ), '----', 'hr' );
36

DESCRIPTION

38       HTML::WikiConverter (or H::WC, for short) is an HTML to wiki converter.
39       It can convert HTML source into a variety of wiki markups, called wiki
40       "dialects".  This manual describes how you to create your own dialect
41       to be plugged into HTML::WikiConverter.
42

DIALECTS

44       Each dialect has a separate dialect module containing rules for
45       converting HTML into wiki markup specific for that dialect. Currently,
46       all dialect modules are in the "HTML::WikiConverter::" package space
47       and subclass HTML::WikiConverter. For example, the MediaWiki dialect
48       module is HTML::WikiConverter::MediaWiki, while PhpWiki's is
49       HTML::WikiConverter::PhpWiki. However, dialect modules need not be in
50       the "HTML::WikiConverter::" package space; you may just as easily use
51       "package MyWikiDialect;" and H::WC will Do The Right Thing.
52
53       From now on, I'll be using the terms "dialect" and "dialect module"
54       interchangeably.
55
56   Subclassing
57       To interface with H::WC, dialects need to subclass it. This is done
58       like so at the start of the dialect module:
59
60         package HTML::WikiConverter::MySlimWiki;
61         use base 'HTML::WikiConverter';
62
63   Conversion rules
64       Dialects guide H::WC's conversion process with a set of rules that
65       define how HTML elements are turned into their wiki counterparts.  Each
66       rule corresponds to an HTML tag and there may be any number of rules.
67       Rules are specified in your dialect's rules() method, which returns a
68       reference to a hash of rules. Each entry in the hash maps a tag name to
69       a set of subrules, as in:
70
71           $tag => \%subrules
72
73       where $tag is the name of the HTML tag (e.g., "b", "em", etc.)  and
74       %subrules contains subrules that specify how that tag will be converted
75       when it is encountered in the HTML input.
76
77       Subrules
78
79       The following subrules are recognized:
80
81         start
82         end
83
84         preserve
85         attributes
86         empty
87
88         replace
89         alias
90
91         block
92         line_format
93         line_prefix
94
95         trim
96
97       A simple example
98
99       The following rules could be used for a dialect that uses "*asterisks*"
100       for bold and "_underscores_" for italic text:
101
102         sub rules {
103           b => { start => '*', end => '*' },
104           i => { start => '_', end => '_' },
105         }
106
107       Aliases
108
109       To add "<strong>" and "<em>" as aliases of "<b>" and "<i>", use the
110       "alias" subrule:
111
112         strong => { alias => 'b' },
113         em => { alias => 'i' },
114
115       (The "alias" subrule cannot be used with any other subrule.)
116
117       Blocks
118
119       Many dialects separate paragraphs and other block-level elements with a
120       blank line. To indicate this, use the "block" subrule:
121
122         p => { block => 1 },
123
124       (To better support nested block elements, if a block elements are
125       nested inside each other, blank lines are only added to the outermost
126       element.)
127
128       Line formatting
129
130       Many dialects require that the text of an element be contained on a
131       single line of text, or that it cannot contain any newlines, etc. These
132       options can be specified using the "line_format" subrule, which can be
133       assigned the value "single", "multi", or "blocks".
134
135       If the element must be contained on a single line, then the
136       "line_format" subrule should be "single". If the element can span
137       multiple lines, but there can be no blank lines contained within, then
138       use "multi". If blank lines (which delimit blocks) are allowed, then
139       use "blocks". For example, paragraphs are specified like so in the
140       MediaWiki dialect:
141
142         p => { block => 1, line_format => 'multi', trim => 'both' },
143
144       Trimming whitespace
145
146       The "trim" subrule specifies whether leading or trailing whitespace (or
147       both) should be stripped from the element. To strip leading whitespace
148       only, use "leading"; for trailing whitespace, use "trailing"; for both,
149       use the aptly named "both"; for neither (the default), use "none".
150
151       Line prefixes
152
153       Some elements require that each line be prefixed with a particular
154       string. This is specified with the "line_prefix" subrule. For example,
155       preformatted text in MediaWiki is prefixed with a space:
156
157         pre => { block => 1, line_prefix => ' ' },
158
159       Replacement
160
161       In some cases, conversion from HTML to wiki markup is as simple as
162       string replacement. To replace a tag and its contents with a particular
163       string, use the "replace" subrule. For example, in PhpWiki, three
164       percent signs, "%%%", represents a line break, "<br>", hence:
165
166         br => { replace => '%%%' },
167
168       (The "replace" subrule cannot be used with any other subrule.)
169
170       Preserving HTML tags
171
172       Some dialects allow a subset of HTML in their markup. While H::WC
173       ignores unhandled HTML tags by default (i.e., if H::WC encounters a tag
174       that does not exist in a dialect's rule specification, then the
175       contents of the tag is simply passed through to the wiki markup), you
176       may specify that some be preserved using the "preserve" subrule. For
177       example, to allow "<font>" tag in wiki markup:
178
179         font => { preserve => 1 },
180
181       Preserved tags may also specify a list of attributes that may also
182       passthrough from HTML to wiki markup. This is done with the
183       "attributes" subrule:
184
185         font => { preserve => 1, attributes => [ qw/ style class / ] },
186
187       (The "attributes" subrule can only be used if the "preserve" subrule is
188       also present.)
189
190       Some HTML elements have no content (e.g., line breaks, images) and the
191       wiki dialect might require them to be preserved in a more XHTML-
192       friendly way. To indicate that a preserved tag should have no content,
193       use the "empty" subrule. This will cause the element to be replaced
194       with "<tag />" and no end tag. For example, MediaWiki handles line
195       breaks like so:
196
197         br => {
198           preserve => 1,
199           attributes => [ qw/ id class title style clear / ],
200           empty => 1
201         },
202
203       This will convert, for example, "<br clear='both'>" into "<br
204       clear='both' />". Without specifying the "empty" subrule, this would be
205       converted into the (probably undesirable) "<br clear='both'></br>".
206
207       (The "empty" subrule can only be used if the "preserve" subrule is also
208       present.)
209
210       Rules that depend on attribute values
211
212       In some circumstances, you might want your dialect's conversion rules
213       to depend on the value of one or more attributes. This can be achieved
214       by producing rules in a conditional manner within rules(). For example:
215
216         sub rules {
217           my $self = shift;
218
219           my %rules = (
220             em => { start => "''", end => "''" },
221             strong => { start => "'''", end => "'''" },
222           );
223
224           $rules{i} = { preserve => 1 } if $self->preserve_italic;
225           $rules{b} = { preserve => 1 } if $self->preserve_bold;
226
227           return \%rules;
228         }
229
230   Dynamic subrules
231       Instead of simple strings, you may use coderefs as values for the
232       "start", "end", "replace", and "line_prefix" subrules. If you do, the
233       code will be called when the subrule is applied, and will be passed
234       three arguments: the current H::WC object, the current HTML::Element
235       node being operated on, and a reference to the hash containing the
236       dialect's subrules associated with elements of that type.
237
238       For example, MoinMoin handles lists like so:
239
240         ul => { line_format => 'multi', block => 1, line_prefix => '  ' },
241         li => { start => \&_li_start, trim => 'leading' },
242         ol => { alias => 'ul' },
243
244       It then defines _li_start():
245
246         sub _li_start {
247           my( $self, $node, $subrules ) = @_;
248           my $bullet = '';
249           $bullet = '*'  if $node->parent->tag eq 'ul';
250           $bullet = '1.' if $node->parent->tag eq 'ol';
251           return "\n$bullet ";
252         }
253
254       This prefixes every unordered list item with "*" and every ordered list
255       item with "1.", which MoinMoin requires. It also puts each list item on
256       its own line and places a space between the prefix and the content of
257       the list item.
258
259   Subrule validation
260       Certain subrule combinations are not allowed. Hopefully it's intuitive
261       why this is, but in case it's not, prohibited combinations have been
262       mentioned above parenthetically. For example, the "replace" and "alias"
263       subrules cannot be combined with any other subrules, and "attributes"
264       can only be specified alongside "preserve". Invalid subrule
265       combinations will trigger a fatal error when the H::WC object is
266       instantiated.
267
268   Dialect attributes
269       H::WC's constructor accepts a number of attributes that help determine
270       how conversion takes place. Dialects can alter these attributes or add
271       their own by defining an attributes() method, which returns a reference
272       to a hash of attributes. Each entry in the hash maps the attribute's
273       name to an attribute specification, as in:
274
275         $attr => \%spec
276
277       where $attr is the name of the attribute and %spec is a
278       Params::Validate specification for the attribute.
279
280       For example, to add a boolean attribute called "camel_case" which is
281       disabled by default:
282
283         sub attributes {
284           camel_case => { default => 0 },
285         }
286
287       Attributes defined liks this are given accessor and mutator methods via
288       Perl's "AUTOLOAD" mechanism, so you can later say:
289
290         my $ok = $wc->camel_case;
291         $wc->camel_case(0);
292
293       You may override the default H::WC attributes using this mechanism. For
294       example, while H::WC considers the "base_uri" attribute optional, it is
295       required for the PbWiki dialect.  PbWiki can override this default-
296       optional behavior by saying:
297
298         sub attributes {
299           base_uri => { optional => 0 }
300         }
301
302   Preprocessing
303       The first step H::WC takes in converting HTML source to wiki markup is
304       to parse the HTML into a syntax tree using HTML::TreeBuilder. It is
305       often useful for dialects to preprocess the tree prior to converting it
306       into wiki markup. Dialects that need to preprocess the tree can define
307       a "preprocess_node" method that will be called on each node of the tree
308       (traversal is done in pre-order). The method receives two arguments,
309       the H::WC object, and the current HTML::Element node being traversed.
310       It may modify the node or decide to ignore it; its return value is
311       discarded.
312
313       Built-in preprocessors
314
315       Because they are commonly needed, H::WC automatically carries out two
316       preprocessing steps, regardless of the dialect: 1) relative URIs in
317       images and links are converted to absolute URIs (based upon the
318       "base_uri" parameter), and 2) ignorable text (e.g. between a "</td>"
319       and "<td>") is discarded.
320
321       H::WC also provides additional preprocessing steps that may be
322       explicitly enabled by dialect modules.
323
324       strip_aname
325           Removes any anchor elements that do not contain an "href"
326           attribute.
327
328       caption2para
329           Removes table captions and reinserts them as paragraphs before the
330           table.
331
332       Dialects may apply these optional preprocessing steps by calling them
333       as methods on the dialect object inside "preprocess_node". For example:
334
335         sub preprocess_node {
336           my( $self, $node ) = @_;
337           $self->strip_aname($node);
338           $self->caption2para($node);
339         }
340
341   Postprocessing
342       Once the work of converting HTML is complete, it is sometimes useful to
343       postprocess the resulting wiki markup. Postprocessing can be used to
344       clean up whitespace, fix subtle bugs introduced in the markup during
345       conversion, etc.
346
347       Dialects that want to postprocess the wiki markup should define a
348       "postprocess_output" method that will be called just before the
349       "html2wiki" method returns to the client. The method will be passed two
350       arguments, the H::WC object and a reference to the wiki markup. The
351       method may modify the wiki markup that the reference points to; its
352       return value is discarded.
353
354       For example, to replace a series of line breaks with a pair of
355       newlines, a dialect might implement this:
356
357         sub postprocess_output {
358           my( $self, $outref ) = @_;
359           $$outref =~ s/<br>\s*<br>/\n\n/gs;
360         }
361
362       (This example assumes that HTML line breaks were replaced with "<br>"
363       in the wiki markup.)
364
365   Dialect utility methods
366       H::WC defines a set of utility methods that dialect modules may find
367       useful.
368
369       get_elem_contents
370
371         my $wiki = $wc->get_elem_contents( $node );
372
373       Converts the contents of $node into wiki markup and returns the
374       resulting wiki markup.
375
376       get_wiki_page
377
378         my $title = $wc->get_wiki_page( $url );
379
380       Attempts to extract the title of a wiki page from the given URL,
381       returning the title on success, "undef" on failure. If "wiki_uri" is
382       empty, this method always return "undef". See "ATTRIBUTES" in
383       HTML::WikiConverter for details on how the "wiki_uri" attribute is
384       interpreted.
385
386       is_camel_case
387
388         my $ok = $wc->is_camel_case( $str );
389
390       Returns true if $str is in CamelCase, false otherwise. CamelCase-ness
391       is determined using the same rules that Kwiki's formatting module uses.
392
393       get_attr_str
394
395         my $attr_str = $wc->get_attr_str( $node, @attrs );
396
397       Returns a string containing the specified attributes in the given node.
398       The returned string is suitable for insertion into an HTML tag.  For
399       example, if $node contains the HTML
400
401         <style id="ht" class="head" onclick="editPage()">Header</span>
402
403       and @attrs contains "id" and "class", then get_attr_str() will return
404       'id="ht" class="head"'.
405
406       _attr
407
408         my $value = $wc->_attr( $name );
409
410       Returns the value of the named attribute. This is rarely needed since
411       you can access attribute values by treating the attribute name as a
412       method (i.e., "$wc->$name"). This low-level method of accessing
413       attributes is provided for when you need to override an attribute's
414       accessor/mutator method, as in:
415
416         sub attributes { {
417           my_attr => { default => 1 },
418         } }
419
420         sub my_attr {
421           my( $wc, $name, $new_value ) = @_;
422           # do something special
423           return $wc->_attr( $name => $new_value );
424         }
425

AUTHOR

427       David J. Iberri <diberri@cpan.org>
428
430       Copyright 2006 David J. Iberri, all rights reserved.
431
432       This program is free software; you can redistribute it and/or modify it
433       under the same terms as Perl itself.
434
435
436
437perl v5.38.0                      2023-07-20  HTML::WikiConverter::Dialects(3)
Impressum